Lectica basics for schools

If you are a school leader, this post is for you. Here, you'll find information about Lectica, it's mission, and our first electronically scored Lectical Assessment—the LRJA.

Background

Lectica, Inc. is a 501(c)(3) charitable corporation. It's mission is to build and deliver learning tools that help students build skills for thinking and learning. These learning tools are backed by a strong learning model—the Virtuous Cycle of Learning (VCoL+7™)—and a comprehensive vision for educational testing and learning, which you can learn more about in our white paper—Virtuous cycles of learning: Redesigning testing during the digital revolution

We have spent over 20 years developing our methods and the technology required to deliver our learning tools—known as Lectical™ Assessments or DiscoTests®—at scale. These assessments are backed by a large body of research, including ongoing investigations of their validity and reliability. Here are some links to research reports:

The following video provides an overview our research and mission:

Current offerings

In the fall of 2016, we introduced our first electronically scored Lectical Assessment—the LRJA (an assessment of reflective judgment/critical thinking). The LRJA can be used in research and program evaluation as a summative assessment, or in the classroom as a formative assessment—or both.

The best way to learn about the LRJA is to experience it first-hand at lecticalive. Just click on this link, then select the "go straight to the demo" button. On the next page, fill in the sign up form with the educational level of your choice. Click "submit", then, click on the "autofill" button (top right, under the header) to fill the responses form with an example. 

If you're interested in working with the LRJA or would like to learn more about using Lectical Assessments to optimize thinking and learning, please contact us.

Adaptive learning, big data, and the meaning of learning

Knewton defines adaptive learning as "A teaching method premised on the idea that the curriculum should adapt to each user." In a recent blog post, Knewton's COO, David Liu, expanded on this definition. Here are some extracts:

You have to understand and have real data on content… Is the instructional content teaching what it was intended to teach? Is the assessment accurate in terms of what it’s supposed to assess? Can you calibrate that content at scale so you’re putting the right thing in front of a student, once you understand the state of that student? 

On the other side of the equation, you really have to understand student proficiency… understanding and being able to predict how that student is going to perform, based upon what they’ve done and based upon that content that I talked about before. And if you understand how well the student is performing against that piece of content, then you can actually begin to understand what that student needs to be able to move forward.

The idea of putting the right thing in front of a students is very cool. That's part of what we do here at Lectica. But what does Knewton mean by learning?

Curiosity got the better of me, so I set out to do some investigating. 

What does Knewton mean by learning?

In Knewton's white paper on adaptive learning the authors do a great job describing how their technology works. 

To provide continuously adaptive learning, Knewton analyzes learning materials based on thousands of data points — including concepts, structure, difficulty level, and media format — and uses sophisticated algorithms to piece together the perfect bundle of content for each student, constantly. The system refines recommendations through network effects that harness the power of all the data collected for all students to optimize learning for each individual student.

They go on to discuss several impressive technological innovations. I have to admit, the technology is cool, but what is their learning model and how is Knewton's technology being used to improve learning and teaching?

Unfortunately, Knewton does not seem to operate with a clearly articulated learning model in mind. In any case, I couldn't find one. But based on the sample items and feedback examples shown in their white paper and on their site, what Knewton means by learning is the ability to consistently get right answers on tests and quizzes, and the way to learn (get more answers right) is to get more practice on the kind of items students are not yet consistently getting right.

In fact, Knewton appears to be a high tech application of the content-focused learning model that's dominated public education since No Child Left Behind—another example of what it looks like when we throw technology at a problem without engaging in a deep enough analysis of that problem.

We're in the middle of an education crisis, but it's not because children aren't getting enough answers right on tests and quizzes. It's because our efforts to improve education consistently fail to ask the most important questions, "Why do we educate our children?" and "What are the outcomes that would be genuine evidence of success?"

Don't get me wrong. We love technology, and we leverage it shamelessly. But we don't believe technology is the answer. The answer lies in a deep understanding of how learning works and what we need to do to support the kind of learning that produces outcomes we really care about. 

 

What every buyer should know about forms of assessment

In this post, I'll be describing and comparing three basic forms of assessment—surveys, tests of factual and procedural knowledge, and performative tests.

Surveys—measures of perception, preference, or opinion

checklistWhat is a survey? A survey (a.k.a. inventory) is any assessment that asks the test-taker to choose from a set of options, such as "strongly agree" or "strongly disagree", based on opinion, preference, or perception. Surveys can be used by organizations in several ways. For example, opinion surveys can help maintain employee satisfaction by providing a "safe" way to express dissatisfaction before workplace problems have a chance to escalate.

Surveys have been used by organizations in a variety of ways. Just about everyone who's worked for a large organization has completed a personality inventory as part of a team-building exercise. The results stimulate lots of water cooler discussions about which "type" or "color" employees are, but their impact on employee performance is unclear. (Fair warning: I'm notorious for my discomfort with typologies!) Some personality inventories are even used in high stakes hiring and promotion decisions, a practice that continues despite evidence that they are very poor predictors of employee success [1].

survey_itemAlthough most survey developers don't pretend their assessments measure competence, many do. The item on the left was used in a survey with the words "management skills" in it's title.

Claims that surveys measure competence are most common when "malleable traits"—traits that are subject to change, learning or growth—are targeted. One example of a malleable trait is "EQ" or "emotional intelligence". EQ is viewed as a skill that can be developed, and there are several surveys that purport to measure its development. What they actually measure is attitude.

Another example of surveys masquerading as assessments of skill is in the measurement of "transformational learning". Transformational learning is defined as a learning experience that fundamentally changes the way a person understands something, yet the only way it appears to be measured is with surveys. Transformational learning surveys measure people's perceptions of their learning experience, not how much they are actually changed by it.

The only survey-type assessments that can be said to measure something like skill are assessments—such as 360s—that ask people about their perceptions. Although 360s inadvertently measure other things, like how much a person is liked or whether or not a respondent agrees with that person, they may also document evidence of behavior change. If what you are interested in is behavior change, a 360 may be appropriate in some cases, but it's important to keep in mind that while a 360 may measure change in a target's behavior, it's also likely to measure change in a respondent's attitude that's unrelated to the target's behavior.

360-type assessments may, to some extent, serve as tests of competence, because behavior change may be an indication that someone has learned new skills. When an assessment measures something that might be an indicator of something else, it is said to measure a proxy. A good 360 may measure a proxy (perceptions of behavior) for a skill (competence).

There are literally hundreds of research articles that document the limitations of surveys, but I'll mention only one more of them here: All of the survey types I've discussed are vulnerable to "gaming"—smart people can easily figure out what the most desirable answers are.

Surveys are extremely popular today because, relative to assessments of skill, they are inexpensive to develop and cost almost nothing to administer. Lectica gives away several high quality surveys for free because they are so inexpensive, yet organizations spend millions of dollars every year on surveys, many of which are falsely marketed as assessments of skill or competence.

Tests of factual and procedural knowledge

A test of competence is any test that asks the test taker to demonstrate a skill. Tests of factual and procedural knowledge can legitimately be thought of as tests of competence.

mc_itemThe classic multiple choice test examines factual knowledge, procedural knowledge, and basic comprehension. If you want to know if someone knows the rules, which formulas to apply, the steps in a process, or the vocabulary of a field, a multiple choice test may meet your needs. Often, the developers of multiple choice tests claim that their assessments measure understanding, reasoning, or critical thinking. This is because some multiple choice tests measure skills that are assumed to be proxies for skills like understanding, reasoning, and critical thinking. They are not direct tests of these skills.

Multiple choice tests are widely used, because there is a large industry devoted to making them, but they are increasingly unpopular because of their (mis)use as high stakes assessments. They are often perceived as threatening and unfair because they are often used to rank or select people, and are not helpful to the individual learner. Moreover, their relevance is often brought into question because they don't directly measure what we really care about—the ability to apply knowledge and skills in real-life contexts.

Performative tests

performative_itemTests that ask people to directly demonstrate their skills in (1) the real world, (2) real-world simulations, or (3) as they are applied to real-world scenarios are called performative tests. These tests usually do not have "right" answers. Instead, they employ objective criteria to evaluate performances for the level of skill demonstrated, and often play a formative role by providing feedback designed to improve performance or understanding. This is the kind of assessment you want if what you care about is deep understanding, reasoning skills, or performance in real-world contexts.

Performative tests are the most difficult tests to make, but they are the gold standard if what you want to know is the level of competence a person is likely to demonstrate in real-world conditions—and if you're interested in supporting development. Standardized performative tests are not yet widely used, because the methods and technology required to develop them are relatively new, and there is not yet a large industry devoted to making them. But they are increasingly popular because they support learning.

Unfortunately, performative tests may initially be perceived as threatening because people's attitudes toward tests of knowledge and skill have been shaped by their exposure to high stakes multiple choice tests. The idea of testing for learning is taking hold, but changing the way people think about something as ubiquitous as testing is an ongoing challenge.

Lectical Assessments

Lectical Assessments are performative tests—tests for learning. They are designed to support robust learning—the kind of learning that optimizes the growth of essential real-world skills. We're the leader of the pack when it comes to the sophistication of our methods and technology, our evidence base, and the sheer number of assessments we've developed.

[1] Frederick P. Morgeson, et al. (2007) Are we getting fooled again? Coming to terms with limitations in the use of personality tests for personnel selection, Personnel Psychology, 60, 1029-1033.

A new kind of report card

report_card_oldWhen I was a kid, the main way school performance was measured was with letter grades. We got letter grades on almost all of our work. Getting an A meant you knew it all, a B meant you didn't quite know it all, C meant you knew enough to pass, D meant you knew so little you were on the verge of faiing, and F meant you failed. If you always got As you were one of the really smart kids, and if you always got Ds and Fs you were one of the dumb kids. Unfortunately, that's how we thought about it, plain and simple. 

If I got a B, my teacher and parents told me I could do better and that I should work harder. If I got a C, I was in deep trouble, and was put on restriction until I brought my grade up. This meant more hours of homework. I suspect this was a common experience. It was certainly what happened on Father Knows Best and The Brady Bunch.

The best teachers also commented on our work, telling us where we could improve our arguments or where and how we had erred, and suggesting actions we could take to improve. In terms of feedback, this was the gold standard. It was the only way we got any real guidance about what we, as individuals, needed to work on next. Letter grades represented rank, punishment, and reward, but they weren't very useful indicators of where we were in our growth as learners. Report cards were for parents. 

Usher in Lectica and DiscoTest

One of our goals here at Lectica has been to make possible a new kind of report card—one that:

  1. delivers scores that have rich meaning for students, parents, and decision-makers,
  2. provides the kind of personal feedback good teachers offer, and
  3. gives students an opportunity to watch themselves grow.

report_cardThis new report card—illustrated on the right—uses a single learning "ruler" for all subjects, so student growth in different subjects can be shown on the same scale. In the example shown here, each assessment is represented by a round button that links to an explanation of the student's learning edge at the time the assessment was taken. 

This new report card also enables direct comparisons between growth trajectories in different subject areas. 

An additional benefit of this new report card is that it delivers a rich portfolio-like account of student growth that can be employed to improve admissions and advancement decisions. 

And finally, we're very curious about the potential psychological benefits of allowing students to watch how they grow. We think it's going to be a powerful motivator.

 

The dark? side of Lectical Assessment

Recently, members of our team at Lectica have been discussing potential misuses of Lectical Assessments, and exploring the possibility that they could harm some students. There are serious concerns that require careful consideration and discussion, and I urge readers to pitch in.

One of the potential problems we've discussed is the possiblilty that students will compare their scores with one another, and that students with lower scores will suffer from these comparisons. Here's my current take on this issue.

Students receive scores all the time. By third grade they already know their position in the class hierarchy, and live everyday with that reality. Moreover, despite the popular notion that all students can become above average if they work hard enough, average students don't often become above average students, which means that during their entire 12 years of schooling, they rarely receive top rewards (the best grades) for the hard work they do. In fact, they often feel like they're being punished even when they try their best. To make things worse, in our current system they're further punished by being forced to memorize content they haven't been prepared to understand, a problem that worsens year by year.

Lectica's approach to assessment can't prevent students from figuring out where their scores land in the class distribution, but we can give all students an opportunity to see themselves as successful learners, no matter where their scores are in that distribution. Average or below average students may still have to live with the reality that they grow at different rates than some of their peers, but they'll be rewarded for their efforts, just the same.

I've been told by some very good teachers that it is unacceptable to use the expression "average student." While I share the instinct to protect students from the harm that can come from labels, I don't share the belief that being an average student is a bad thing. Most of us were average students—or to be more precise, 68% of us were within one standard deviation of the mean. How did being a member of the majority become a bad thing?  And what harm are we doing to students by creating the illusion that we are all capable of performing above the mean?

I don't think we hurt children by serving up reality. We hurt them when we mislead them by telling them they can all be above average, or when we make them feel hopeless by insisting that they all learn at the same pace, then punishing them when they can't keep up.

I'm not saying it's not possible to raise the average. We do it by meeting the specific learning needs of every student and making sure that learning time is spent learning robustly. But we can't change the fact that there's a distribution. And we shouldn't pretend this is the case.

Lectical Assessments are tests, and are subject to the same abuses as other tests. But they have three attributes that help mitigate these abuses. First, they allow all students without severe disabilities to see themselves as learners. Second, they help teachers customize instruction to meet the needs of each student, so more kids have a chance to achieve their full potential. And finally, they reward good pedagogy—even in cases in which the assessments are being misused. After all, testing drives instruction.

Comparison of DiscoTests with conventional tests

DiscoTests and conventional standardized tests can be thought of as complementary. They are designed to test different kinds of skills, and research confirms that they are successful in doing so. Correlations between scores on the kind of developmental assessments made by DTS and scores on conventional multiple choice assessments is in the .40-.60 range. That means that somewhere between 16% to 36% of the kind of learning that is captured by conventional assessments is likely to overlap with the kind of learning that is captured by DiscoTests.

The table below provides a comparison of DiscoTests with conventional standardized tests on a number of dimensions.

Category
DiscoTests
Conventional tests
Theoretical foundation Cognitive developmental theory, Dynamic Skill Theory, Test theory Test theory
Scale Fischer’s Dynamic Skill Scale, an exhaustively researched general developmental scale, which is a member of a family of similar scales that were developed during the 20th century. Statistically generated scales, different for each test (though some tests are statistically linked)
Learning sequences Empirical, fine-grained & precise, calibrated to the dynamic skill scale Empirical, coarse-grained and general
Primary item type Open response More or less sophisticated forms of multiple choice
Targeted skills Reasoning with knowledge, knowledge application, making connections between new and existing knowledge, writing Content knowledge, procedural knowledge
Content Carefully selected “big ideas” and the concepts and skills associated with them. The full range of content specified in state standards for a given subject
Educative/formative Yes, (1) each DiscoTest focuses on ideas and skills central K-12 curricula, (2) test questions require students to thoughtfully apply new knowledge and connect it with their existing knowledge, (3) students receive reports with targeted feedback and learning suggestions, (4) teachers learn how student knowledge develops in general and on each targeted concept or skill. Not really, though increasingly claim to be
Embeddable in curricula Yes, DiscoTests are designed to be part of the curriculum. No
Standardized Yes, statistically, calibrated to the skill scale Yes, statistically only
Stakes Low. Selection decisions are based on performance patterns over time on many individual assessments. High. Selection decisions are often based on single assessments.
Ecological validity Direct tests that focus on deepening and connecting knowledge about key concepts and ideas, while developing broad skills that are required in adult life, such as those required for reasoning, communicating, and problem-solving. Tests of proxies, focus on ability to detect correct answers.
Statistical reliability .91+ for a single age cohort (distinguishes 5-6 distinct levels of performance). For high stakes tests, usually .95+ for a single age cohort (distinguishes 6-7 distinct levels of performance).

Bottom of the class syndrome

What happens if you consistently punish someone for engaging in a particular practice? Most of us assume that it will get them to stop. Right? So, why have we developed an educational system that punishes millions of children for learning?

Because all capabilities are distributed in the population as “bell curves,” half of all children inevitably will be in the bottom half of their class. Most of these students will consistently receive grades that reflect poor performance relative to other students, primarily because (for a variety of reasons) they learn more slowly than their age mates. Young students tend to understand poor grades as punishments for poor performance or evidence of stupidity. The occasional low grade that is clearly attached to a lack of effort can act as an incentive to try harder. But consistently low grades with no hope of improvement teach students that learning is bad, because no matter how hard they try, it leads to punishment. I call this “bottom of the class syndrome.”

Clearly “learning is bad” is not the message teachers mean to send. They expect the punishment of a low grade to motivate students and their parents to try harder, and this is likely to work for many of the students in the upper half of the class. But it is not likely to work for kids in the bottom half of the class, because most of them are consistently slower learners. Most students in the bottom half of the class need more time and must make more effort to learn concepts and skills than students in the upper half of the class. They can’t afford to dislike learning; even more than students in the top of the class, they must retain their inborn love of learning to have any hope of success over the long term.

As long as we award scores that rank students, about half of those students will be vulnerable to “bottom of the class syndrome.” Society will lose many of them as learners, and their life choices and contributions will be unnecessarily restricted.

My colleagues and I have been working on this problem for many years. The solution we offer requires sophisticated ways of thinking about learning and motivation as well as new tools for evaluating learning—tools that don’t compare students, but allow each of them to develop on their own timeline. And it requires that we systematically study how students learn every single skill or concept we teach, so we know what every score on every assessment really means—in terms of what a given student understands and what he or she is likely to benefit from learning next. It’s hard work, but we’ve learned how to do it. Humans have tackled much more challenging problems. All we need is the will to ensure that every child has an opportunity to realize his or her potential, the patience to do the work, good people to carry the work forward, and a little thing called funding.

The limitations of testing

It is important for those of us who use assessments to ensure that they (1) measure what we say they measure, (2) measure it reliably enough to justify claimed distinctions between and within persons, and (3) are used responsibly. It is relatively easy for testing experts to create assessments that are adequately reliable (2) for individual assessment, and although it is more difficult to show that these tests measure the construct of interest (1), there are reasonable methods for showing that an assessment meets this standard. However, it is more difficult to ensure that assessments are used responsibly (3).

Few consumers of tests are aware of their inherent limitations. Even the best tests, those that are highly reliable and measure what they are supposed to measure, provide only a limited amount of information. This is true of all measures. The more we hone in on a measureable dimension—in other words, the greater our precision becomes—the narrower the construct becomes. Time, weight, height, and distance are all extremely narrow constructs. This means that they provide a very specific piece of information extremely well. When we use a ruler, we can have great confidence in the measurement we make, down to very small lengths (depending on the ruler, of course). No one doubts the great advantages of this kind of precision. But we can’t learn anything else about the measured object. Its length usually cannot tell us what the object is, how it is shaped, its color, its use, its weight, how it feels, how attractive it is, or how useful it is. We only know how long it is. To provide an accurate account of the thing that was measured, we need to know many more things about it, and we need to construct a narrative that brings these things together in a meaningful way.

A really good psychological measure is similar. The LAS (Lectical Assessment System), for example, is designed to go to the heart of development, stripping away everything that does not contribute to the pure developmental “height” of a given performance. Without knowledge of many other things—such as the ways of thinking that are generally associated with this “height” in a particular domain, the specific ideas that are associated with this particular performance, information from other performances on other measures, qualitative observations, and good clinical judgment—we cannot construct a terribly useful narrative.

And this brings me to my final point: A formal measure, no matter how great it is, should always be employed by a knowledgeable mentor, clinician, teacher, consultant, or coach as a single item of information about a given client that may or may not provide useful insights into relevant needs or capabilities. Consider this relatively simple example: a given 2-year-old may be tall for his age, but if he is somewhat under weight for his age, the latter measure may seem more important. However, if he has a broken arm, neither measure may loom large—at least until the bone is set. Once the arm is safely in a cast, all three pieces of information—weight, height, and broken arm—may contribute to a clinical diagnosis that would have been difficult to make without any one of them.

It is my hope that the educational community will choose to adopt high standards for measurement, then put measurement in its place—alongside good clinical judgment, reflective life experience, qualitative observations, and honest feedback from trusted others.

What is a holistic assessment?

Thirty years ago, when I was a hippy midwife, the idea of holism began to slip into the counter-culture. A few years later, this much misunderstood notion was all the rage on college campuses. By the time I was in graduate school in the nineties there was a impassable division between the trendy postmodern holists and the rigidly old fashioned modernists. You may detect a slight mocking tone, and rightly so. People with good ideas on both sides made themselves look pretty silly by refusing, for example, to use any of the tools associated with the other side. One of the more tragic outcomes of this silliness was the emergence of the holistic assessment.

Simply put, the holistic assessment is a multidimensional assessment that is designed to take a more nuanced, textured, or rich approach to assessment. Great idea. Love it.

It’s the next part that’s silly. Having collected rich information on multiple dimensions, the test designers sum up a person’s performance with a single number. Why is this silly? Because the so-called holistic score becomes pretty-much meaningless. Two people with the same score can have very little in common. For example, let’s imagine that a holistic assessment examines emotional maturity, perspective taking, and leadership thinking. Two people receive a score of 10 that may be accompanied by boilerplate descriptions of what emotional maturity, perspective taking, and leadership attitudes look like at level 10. However, person one was actually weak in perspective-taking and strongest in leadership, and person two was weak in emotional maturity and strongest in perspective taking. The score of 10, it turns out, means something quite different for these two people. I would argue that it is relatively meaningless because there is no way to know, based on the single “holistic” score, how best to support the development of these distinct individuals.

Holism has its roots in system dynamics, where measurements are used to build rich models of systems. All of the measurements are unidimensional. They are never lumped together into “holistic” measures. That would be equivalent to talking about the temperaturelength of a day or the lengthweight of an object*. It’s essential to measure time, weight, and length with appropriate metrics and then to describe their interrelationships and the outcomes of these interrelationships. The language used to describe these is the language of probability, which is sensitive to differences in the measurement of different properties.

In psychological assessment, dimensionality is a challenging issue. What constitutes a single dimension is a matter for debate. For DTS, the primary consideration is how useful an assessment will be in helping people learn and grow. So, we tend to construct individual assessments, each of which represents a fairly tightly defined content space, and we use only one metric to determine the level of a performance. The meaning of a given score is both universal (it is an order of hierarchical complexity and phase on the skill scale) and contextual (it is provided to a performance in a particular domain in a particular context, and is associated with particular content.) We independently analyze the content of the performance to determine its strengths and weaknesses—relative to its level and the known range of content associated with that level—and provide feedback about these strengths and weaknesses as well as targeted learning suggestions. We use the level score to help us tell a useful story about a particular performance, without claiming to measure “lenghtweight”. This is accomplished by the rigorous separation of structure (level) and content.

*If we described objects in terms of their lengthweight, an object that was 10 inches long and 2 lbs could have a lengthweight of 12, but so could an object that was 2 inches long and 10 lbs.

Teacher pay and standardized test results

At the end of October, the Century Foundation released a paper entitled, Eight reasons not to tie teacher pay to standardized test results. I agree with their conclusions, and would add that even if all standardized tests were extremely reliable and measured exactly what they intended to measure, this would be a bad idea. This is because success in the adult world requires a multiplicity of skills and forms of knowledge, and tests focus on only some of these, one at a time. Until we can construct multifaceted longitudinal stories about the progress of individual students that are tied to a non-arbitrary standardized metric, we should not even consider linking student evaluations to teacher pay.