Statistics for all: Estimating confidence

In the first post in this series, I promised to share a quick and dirty trick for determining how much confidence you can have in a test score. I will. But first, I want to show you a bit more about what estimating confidence means when it comes to educational and psychological tests.

Let’s start with a look at how test scores are usually reported. The figure below shows three scores, one at level 8, one at level 6, and one at level 4. Looking at this figure, most of us would be inclined to assume that these scores are what they seem to be—precise indicators of the level of a trait or skill.

How test scores are usually presented

But this is not the case. Test scores are fuzzy. They’re best understood as ranges rather than as points on a ruler. In other words, test scores are always surrounded by confidence intervals. A person’s true score is likely to fall somewhere in the range described by the confidence interval around a test score.

In order to figure out how fuzzy a test score actually is, you need one thing—an indicator of statistical reliability. Most of the time, this is something called Cronbach’s Alpha. All good test developers publish information about the statistical reliability of their measures, ideally in refereed academic journals with easy to find links on their web sites! If a test developer won’t provide you with information about Alpha (or its equivalent) for each score reported on a test, it’s best to move on.

The higher the reliability (usually Alpha) the smaller the confidence interval. And the smaller the confidence interval, the more confidence you can have in a test score.

The table below will help to clarify why it is important to know Alpha (or its equivalent). It shows the relationship between Alpha (which can range from 0 to 1.0) and the number of distinct levels (strata) a test can be said to have. For example, an assessment with a reliability of .80, has 3 strata, whereas an assessment with a reliability of .94 has 5.

Reliability Strata
.70 2
.80 3
.90 4
.94 5
.95 6
.96 7
.97 8
.98 9

Strata have direct implications for the confidence we can have in a person’s score on a given assessment, because they tell us about the range within which a person’s true score would fall—its confidence interval—given the score awarded.

Imagine that you have just taken a test of emotional intelligence with a score range of 1 to 10 and a reliability of .95. The number of strata into which an assessment with a reliability of .95 can be divided is about 6, which means that each strata equals about 1.75 points on the 10 point scale (10 divided by 6). If your score on this test was 8, your true score would likely be somewhere between 7.13 and 8.88—your score’s confidence interval.

The figure below shows the true score ranges for three test takers, CB, RM, and PR. The fact that these ranges don’t overlap gives us confidence that the emotional intelligence of these test-takers is actually different**.

If these scores were closer together, their confidence intervals would overlap. And if that was the case—for example if you were comparing two individuals with scores of 8 and 8.5—it would not be correct to say the scores were different form one another. In fact, it would be incorrect for a hiring manager to consider the difference between a score of 8 and a score of 8.5 in making a choice between two job candidates.

By the way, tests with Alphas in the range of .94 or higher are considered suitable for high-stakes use (assuming that they meet other essential validity requirements). What you see in the figure below is about as good as it gets in educational and psychological assessment.

estimating confidence when alpha is .95

Most assessments used in organizations do not have Alphas that are anywhere near .95. Some of the better assessments have Alphas as high as .85. Let’s take a look at what an Alpha at this level does to confidence intervals.

If the test you have taken has a score range of 1–10 and an Alpha (reliability) of .85, the number of strata into which this assessment can be divided is about 3.4, which means that each strata equals about 2.9 (10 divided by 3.4) points on the 10 point scale. In this case, if you receive a score of 8, your true score is likely to fall within the range of 6.6 to 9.5*.

In the figure below, note that CB’s true score range now overlaps RM’s true score range and RM’s true score range overlaps PR’s true score range. This means we cannot say—with confidence—that CB’s score is different from RM’s score, or that RM’s score is different from PR’s score.

Assessments with Alphas in the .85 range are suitable for classroom use or low-stakes contexts. Yet, every day, schools and businesses use tests with reliabilities in the .85 range to make high stakes decisions—such as who will be selected for advancement or promotion. And this is often done in a way that would exclude RM (yellow circle) even though his confidence interval overlaps CB’s (teal circle) confidence interval.

estimating confidence when alpha is .85

Many tests used in organizations have Alphas in the .75 range. If the test you have taken has a score range of 1–10 and an Alpha of .75, the number of strata into which this assessment can be divided is about 2.2, which means that each strata equals about 4.5 points on the 10 point scale. In this case, if you receive a score of 8, your true score is likely to fall within the range of 6–10*.

As shown in the figure below, scores would now have to differ by at least 4.5 points in order for us to distinguish between two people. CB’s and PR’s scores are different, but RM’s score is uninterpretable.

Tests or sub-scales with alphas in the .75 range are considered suitable for research purposes. Yet, sad to say, schools and businesses now use tests with scales or sub-scales that have Alphas in or below the .75 range, treating these scores as if they provide useful information, when in most cases the scores—like RM’s—are uninterpretable.

estimating confidence when alpha is .75

If your current test providers are not reporting true score ranges (confidence intervals), ask for them. If they only provide Alphas (reliability statistics) you can use the table and figures in this article to calculate true score ranges for yourself. If you don’t want to do the math, no problem. You can use the figures above to get a feel for how precise a score is.

Statistical reliability is only one of the ways in which assessments should be evaluated. Test developers should also ask how well an assessment measures what it is intended to measure. And those who use an assessment should ask whether or not what it measures is relevant or important. I’ll be sharing some tricks for looking at these forms of validity in future articles.

Related Articles

Statistics for all: What the heck is confidence?


*This range will be wider at the top and bottom of the scoring range and a bit narrower in the middle of the range.

**It doesn’t tell us if emotional intelligence is important. That is determined in other ways.


References

Guilford J. P. (1965). Fundamental statistics in psychology and education. 4th Edn. New York: McGraw-Hill.

Kubiszyn T., Borich G. (1993). Educational testing and measurement. New York: Harper Collins.

Wright B. D. (1996). Reliability and separation. Rasch Measurement Transactions, 9, 472.

 

Please follow and like us:

World Economic Forum—tomorrow’s skills

The top 10 workplace skills of the future.

Sources: Future of Jobs Report, WEF 2017

In a recent blog post—actually in several recent blog posts—I've been emphasizing the importance of building tomorrow's skills. These are the kinds of skills we all need to navigate our increasingly complex and changing world. While I may not agree that all of the top 10 skills listed in the World Economic Forum report (shown above) belong in a list of skills (Creativity is much more than a skill, and service orientation is more of a disposition than a skill.) the flavor of this list is generally in sync with the kinds of skills, dispositions, and behaviors required in a complex and rapidly changing world.

The "skills" in this list cannot be…

  • developed in learning environments focused primarily on correctness or in workplace environments that don't allow for mistakes; or
  • measured with ratings on surveys or on tests of people's ability to provide correct answers.

These "skills" are best developed through cycles of goal setting, information gathering, application, and reflection—what we call virtuous cycles of learning—or VCoLs. And they're best assessed with tests that focus on applications of skill in real-world contexts, like Lectical Assessments, which are based on a rich research tradition focused on the development of understanding and skill.

 

Please follow and like us:

Learning how to learn or learning how to pass tests?

how to learnI've been auditing a very popular 4.5 star Coursera course called "Learning how to learn." It uses all of the latest research to help people improve their learning skills. Yet, even though the lectures in the course are interesting and the research behind the course appears to be sound, I find it difficult to agree that it is a course that helps people learn how to learn.

First, the tests used to determine how well participants have built the learning skills described in this course are actually tests of how well they have learned vocabulary and definitions. As far as I can tell, no skills are involved other than the ability to recall course content. This is problematic. The assumption that learning vocabulary and definitions builds skill is unwarranted. I believe we all know this. Who has not had the experience of learning something well enough to pass a test only to forget most of what they had learned shortly thereafter?

Second, the content in tests at the end of the videos aren't particularly relevant to the stated intention of the course. These tests require remembering (or scrolling back to) facts like "Many new synapses are formed on dendrites." We do not need to learn this to become effective learners. The test item for which this is the correct answer is focused on an aspect of how learning works rather than how to learn. And although understanding how learning works might be a step toward learning how to learn, answering this question correctly doesn't tell us how the participant understands anything at all.

Third, if the course developers had used tests of skill—tests that asked participants to show off how effectively they could apply described techniques, we would be able to ask about the extent to which the course helps participants learn how to learn. Instead, the only way we have to evaluate the effectiveness of the course is through participant ratings and comments—how much people like it. I'm not suggesting that liking a course is unimportant, but it's not a good way to evaluate its effectiveness.

Fourth, the course seems to be primarily concerned with fostering a kind of learning that helps people do better on tests of correctness. The underlying and unstated assumption seems to be that if you can do better on these tests, you have learned better. This assumption flies in the face of several decades of educational research, including our own [for example, 1, 2, 3]. Correctness is not adequate evidence of understanding or real-world skill. If we want to know how well people understand new knowledge, we must observe how they apply this knowledge in real-world contexts. If we want to evaluate their level of skill, we must observe how well they apply the skill in real-world contexts. In other words, a course in learning how to learn—especially a course in learning how to learn—should be building useable skills that have value beyond the act of passing a test of correctness.

Fifth, the research behind this course can help us understand how learning works. At Lectica, we've used the very same information as part of the basis for our learning model, VCoL+7. But instead of using this knowledge to support the status quo—an educational system that privileges correctness over understanding and skill—we're using it to build learning tools designed to ensure that learning in school goes beyond correctness to build deep understanding and robust skill.

For the vast majority of people, schooling is not an end in itself. It is preparation for life—preparation with tomorrow's skills. It's time we held our educational institutions accountable for ensuring that students know how to learn more than correct answers. Wherever their lives take them, they will do better if equipped with understanding and skill. Correctness is not enough.

 


[1] FairTest; Mulholland, Quinn  (2015). The case against standardized testing. Harvard Political Review, May 14.

[2] Schwartz, M. S., Sadler, P. M., Sonnert, G. & Tai, R. H. (2009). Depth versus breadth: How content coverage in high school science courses relates to later success in college science coursework. Science Education, 93, 5, 798-826.

[3] Kontra, C., Goldin-Meadow, S., & Beilock, S. L. (2012). Embodied learning across the lifespan. Topics in Cognitive Science, 4, 4, 731–739.

 

Please follow and like us:

Lectica’s story: long, rewarding, & still unfolding


Lectica's story started in Toronto in 1976…

Identifying the problem

During the 70s and 80s I practiced midwifery. It was a great honor to be present at the births of over 500 babies, and in many cases, follow them into childhood. Every single one of those babies was a joyful, driven, and effective "every moment" learner. Regardless of difficulty and pain they all learned to walk, talk, interact with others, and manipulate many aspects of their environment. They needed few external rewards to build these skills—the excitement and suspense of striving seemed to be reward enough. I felt like I was observing the "life force" in action.

Unfortunately as many of these children approached the third grade (age 8), I noticed something else—something deeply troubling. Many of the same children seemed to have lost much of this intrinsic drive to learn. For them, learning had become a chore motivated primarily by extrinsic rewards and punishments. Because this was happening primarily to children attending conventional schools (Children receiving alternative instruction seemed to be exempt.) it appeared that something about schooling was depriving many children of the fundamental human drive required to support a lifetime of learning and development—a drive that looked to me like a key source of happiness and fulfillment.

Understanding the problem

Following upon my midwifery career, I flirted briefly with a career in advertising, but by the early 90's I was back in school—in a Ph.D. program in U. C. Berkeley's Graduate School of Education—where I found myself observing the same pattern I'd observed as a midwife. Both the research and my own lab experience exposed the early loss of students' natural love of learning. My concern was only increased by the newly emerging trend toward high stakes multiple choice testing, which my colleagues and I saw as a further threat to children's natural drive to learn.

Most of the people I've spoken to about this problem have agreed that it's a shame, but few have seen it as a problem that can be solved, and many have seen it as an inevitable consequence of either mass schooling or simple maturation. But I knew it was not inevitable. Children and those educated in a range of alternative environments did not appear to lose their drive to learn. Additionally, above average students in conventional schools appeared to be more likely to retain their love of learning.

I set out to find out why—and ended up on a long journey toward a solution.

How learning works

First, I needed to understand how learning works. At Berkeley, I studied a wide variety of learning theories in several disciplines, including developmental theories, behavioral theories, and brain-based theories. I collected a large database of longitudinal interviews and submitted them to in-depth analysis, looked closely at the relation between testing and learning, and studied psychological measurement, all in the interest of finding a way to support childrens' growth while reinforcing their love of learning.

My dissertation—which won awards from both U.C. Berkeley and the American Psychological Association—focused on the development of people's conceptions of learning from age 5 through 85, and how this kind of knowledge could be used to measure and support learning. In 1998, I received $500,000 from the Spencer Foundation to further develop the methods designed for this research. Some of my areas of expertise are human learning and development, psychometrics, metacognition, moral education, and research methods.

In the simplest possible terms, what I learned in 5 years of graduate school is that the human brain is designed to drive learning, and that preserving that natural drive requires 5 ingredients:

  1. a safe environment that is rich in learning opportunities and healthy human interaction,
  2. a teacher who understands each child's interests and level of tolerance for failure,
  3. a mechanism for determining "what comes next"—what is just challenging enough to allow for success most of the time (but not all of the time),
  4. instant actionable feedback, and 
  5. the opportunity to integrate new knowledge or skills into each learner's existing knowledge network well enough to make it useable before pushing instruction to the next level. (We call this building a "robust knowledge network"—the essential foundation for future learning.)*

Identifying the solution

Once we understood what learning should look like, we needed to decide where to intervene. The answer, when it came, was a complete surprise. Understanding what comes next—something that can only be learned by measuring what a student understands now—was an integral part of the recipe for learning. This meant that testing—which we originally saw as an obstacle to robust learning—was actually the solution—but only if we could build tests that would free students to learn the way their brains are designed to learn. These tests would have to help teachers determine "what comes next" (ingredient 3) and provide instant actionable feedback (ingredient 4), while rewarding them for helping students build robust knowledge networks (ingredient 5).

Unfortunately, conventional standardized tests were focused on "correctness" rather than robust learning, and none of them were based on the study of how targeted concepts and skills develop over time. Moreover, they were designed not to support learning, but rather to make decisions about advancement or placement, based on how many correct answers students were able to provide relative to other students. Because this form of testing did not meet the requirements of our learning recipe, we'd have to start from scratch.

Developing the solution

We knew that our solution—reinventing educational testing to serve robust learning—would require many years of research. In fact, we would be committing to possible decades of effort without a guaranteed result. It was the vision of a future educational system in which all children retained their inborn drive for learning that ultimately compelled us to move forward. 

To reinvent educational testing, we needed to:

  1. make a deep study of precisely how children build particular knowledge and skills over time in a wide range of subject areas (so these tests could accurately identify "what comes next");
  2. make tests that determine how deeply students understand what they have learned—how well they can use it to address real-world issues or problems (requires that students show how they are thinking, not just what they know—which means written responses with explanations); and
  3. produce formative feedback and resources designed to foster "robust learning" (build robust knowledge networks).

Here's what we had to invent:

  1. A learning ruler (building on Commons [1998] and Fischer [2006]);
  2. A method for studying how students learn tested concepts and skills (refining the methods developed for my dissertation);
  3. A human scoring system for determining the level of understanding exhibited in students' written explanations (building upon Commons' and Fischer's methods, refining them until measurements were precise enough for use in educational contexts); and 
  4. An electronic scoring system, so feedback and resources could be delivered in real time.

It took over 20 years (1996–2016), but we did it! And while we were doing it, we conducted research. In fact, our assessments have been used in dozens of research projects, including a 25 million dollar study of literacy conducted at Harvard, and numerous Ph.D. dissertations—with more on the way.

What we've learned

We've learned many things from this research. Here are some that took us by surprise:

  1. Students in schools that focus on building deep understanding graduate seniors that are up to 5 years ahead (on our learning ruler) of students in schools that focus on correctness (2.5 to 3 years after taking socioeconomic status into account).
  2. Students in schools that foster robust learning develop faster and continue to develop longer (into adulthood) than students in schools that focus on correctness.
  3. On average, students in schools that foster robust learning produce more coherent and persuasive arguments than students in schools that focus on correctness.
  4. On average, students in our inner-city schools, which are the schools most focused on correctness, stop developing (on our learning ruler) in grade 10. 
  5. The average student who graduates from a school that strongly focuses on correctness is likely, in adulthood, to (1) be unable to grasp the complexity and ambiguity of many common situations and problems, (2) lack the mental agility to adapt to changes in society and the workplace, and (3) dislike learning. 

From our perspective, these results point to an educational crisis that can best be addressed by allowing students to learn as their brains were designed to learn. Practically speaking, this means providing learners, parents, teachers, and schools with metrics that reward and support teaching that fosters robust learning. 

Where we are today

Lectica has created the only metrics that meet all of these requirements. Our mission is to foster greater individual happiness and fulfillment while preparing students to meet 21st century challenges. We do this by creating and delivering learning tools that encourage students to learn the way their brains were designed to learn. And we ensure that students who need our learning tools the most get them first by providing free subscriptions to individual teachers everywhere.

To realize our mission, we organized as a nonprofit. We knew this choice would slow our progress (relative to organizing as a for-profit and welcoming investors), but it was the only way to guarantee that our true mission would not be derailed by other interests.

Thus far, we've funded ourselves with work in the for-profit sector and income from grants. Our background research is rich, our methods are well-established, and our technology works even better than we thought it would. Last fall, we completed a demonstration of our electronic scoring system, CLAS, a novel technology that learns from every single assessment taken in our system. 

The groundwork has been laid, and we're ready to scale. All we need is the platform that will deliver the assessments (called DiscoTests), several of which are already in production.

After 20 years of high stakes testing, students and teachers need our solution more than ever. We feel compelled to scale a quickly as possible, so we can begin the process of reinvigorating today's students' natural love of learning, and ensure that the next generation of students never loses theirs. Lectica's story isn't finished. Instead, we find ourselves on the cusp of a new beginning! 

Please consider making a donation today.

 


A final note: There are many benefits associated with our approach to assessment that were not mentioned here. For example, because the assessment scores are all calibrated to the same learning ruler, students, teachers, and parents can easily track student growth. Even better, our assessments are designed to be taken frequently and to be embedded in low-stakes contexts. For grading purposes, teachers are encouraged to focus on growth over time rather than specific test scores. This way of using assessments pretty much eliminates concerns about cheating. And finally, the electronic scoring system we developed is backed by the world's first "taxonomy of learning," which also serves many other educational and research functions. It's already spawned a developmentally sensitive spell-checker! One day, this taxonomy of learning will be robust enough to empower teachers to create their own formative assessments on the fly. 

 


*This is the ingredient that's missing from current adaptive learning technologies.

 

Please follow and like us:

Proficiency vs. growth

We've been hearing quite a bit about the "proficiency vs. growth" debate since Betsy DeVos (Trump's candidate for Education Secretary) was asked to weigh in last week. This debate involves a disagreement about how high stakes tests should be used to evaluate educational programs. Advocates for proficiency want to reward schools when their students score higher on state tests. Advocates for growth want to reward schools when their students grow more on state tests. Readers who know about Lectica's work can guess where we'd land in this debate—we're outspokenly growth-minded. 

For us, however, the proficiency vs. growth debate is only a tiny piece of a broader issue about what counts as learning. Here's a sketch of the situation as we see it:

Getting a higher score on a state test means that you can get more correct answers on increasingly difficult questions, or that you can more accurately apply writing conventions or decode texts. But these aren't the things we really want to measure. They're "proxies"—approximations of our real learning objectives. Test developers measure proxies because they don't know how to measure what we really want to know.

What we really want to know is how well we're preparing students with the skills and knowledge they'll need to successfully navigate life and work.

Scores on conventional tests predict how well students are likely to perform, in the future, on conventional tests. But scores on these tests have not been shown to be good predictors of success in life.*  

In light of this glaring problem with conventional tests, the debate between proficiency and growth is a bit of a red herring. What we really need to be asking ourselves is a far more fundamental question:

What knowledge and skills will our children need to navigate the world of tomorrow, and how can we best nurture their development?

That's the question that frames our work here at Lectica.

 

*For information about the many problems with conventional tests, see FairTest.

 

Please follow and like us:

Straw men and flawed metrics

khan_constructivistTen years ago, Kirschner, Sweller, & Clark published an article entitled, Why minimal guidance during instruction does not work: An analysis of the failure of constructivist, discovery, problem-based, experiential, and inquiry-based teaching.

In this article, Kirschner and his colleagues contrast outcomes for what they call "guidance instruction" (lecture and demonstration) with those from constructivism-based instruction. They conclude that constructivist approaches produce inferior outcomes.

The article suffers from at least three serious flaws

First, the authors, in making their distinction between guided instruction and constructivist approaches, have created a caricature of constructivist approaches. Very few experienced practitioners of constructivist, discovery, problem-based, experiential, or inquiry-based teaching would characterize their approach as minimally guided. "Differently guided" would be a more appropriate term. Moreover, most educators who use constructivist approaches include lecture and demonstration where these are appropriate.

Second, the research reviewed by the authors was fundamentally flawed. For the most part, the metrics employed to evaluate different styles of instruction were not reasonable measures of the kind of learning constructivist instruction aims to support—deep understanding (the ability to apply knowledge effectively in real-world contexts). They were measures of memory or attitude. Back in 2010, Stein, Fisher, and I argued that metrics can't produce valid results if they don't actually measure what we care about  (Redesigning testing: Operationalizing the new science of learning. Why isn't this a no-brainer?

And finally, the longitudinal studies Kirschner and his colleagues reviewed had short time-spans. None of them examined the long-term impacts of different forms of instruction on deep understanding or long-term development. This is a big problem for learning research—one that is often acknowledged, but rarely addressed.

Since Kirschner's article was published in 2006, we've had an opportunity to examine the difference between schools that provide different kids of instruction, using assessments that measure the depth and coherence of students' understanding. We've documented a 3 to 5 year advantage, by grade 12, for students who attend schools that emphasize constructivist methods vs. those that use more "guidance instruction". 

To learn more, see:

Are our children learning robustly?

Lectica rationale

 

Please follow and like us:

What every buyer should know about forms of assessment

In this post, I'll be describing and comparing three basic forms of assessment—surveys, tests of factual and procedural knowledge, and performative tests.

Surveys—measures of perception, preference, or opinion

checklistWhat is a survey? A survey (a.k.a. inventory) is any assessment that asks the test-taker to choose from a set of options, such as "strongly agree" or "strongly disagree", based on opinion, preference, or perception. Surveys can be used by organizations in several ways. For example, opinion surveys can help maintain employee satisfaction by providing a "safe" way to express dissatisfaction before workplace problems have a chance to escalate.

Surveys have been used by organizations in a variety of ways. Just about everyone who's worked for a large organization has completed a personality inventory as part of a team-building exercise. The results stimulate lots of water cooler discussions about which "type" or "color" employees are, but their impact on employee performance is unclear. (Fair warning: I'm notorious for my discomfort with typologies!) Some personality inventories are even used in high stakes hiring and promotion decisions, a practice that continues despite evidence that they are very poor predictors of employee success [1].

survey_itemAlthough most survey developers don't pretend their assessments measure competence, many do. The item on the left was used in a survey with the words "management skills" in it's title.

Claims that surveys measure competence are most common when "malleable traits"—traits that are subject to change, learning or growth—are targeted. One example of a malleable trait is "EQ" or "emotional intelligence". EQ is viewed as a skill that can be developed, and there are several surveys that purport to measure its development. What they actually measure is attitude.

Another example of surveys masquerading as assessments of skill is in the measurement of "transformational learning". Transformational learning is defined as a learning experience that fundamentally changes the way a person understands something, yet the only way it appears to be measured is with surveys. Transformational learning surveys measure people's perceptions of their learning experience, not how much they are actually changed by it.

The only survey-type assessments that can be said to measure something like skill are assessments—such as 360s—that ask people about their perceptions. Although 360s inadvertently measure other things, like how much a person is liked or whether or not a respondent agrees with that person, they may also document evidence of behavior change. If what you are interested in is behavior change, a 360 may be appropriate in some cases, but it's important to keep in mind that while a 360 may measure change in a target's behavior, it's also likely to measure change in a respondent's attitude that's unrelated to the target's behavior.

360-type assessments may, to some extent, serve as tests of competence, because behavior change may be an indication that someone has learned new skills. When an assessment measures something that might be an indicator of something else, it is said to measure a proxy. A good 360 may measure a proxy (perceptions of behavior) for a skill (competence).

There are literally hundreds of research articles that document the limitations of surveys, but I'll mention only one more of them here: All of the survey types I've discussed are vulnerable to "gaming"—smart people can easily figure out what the most desirable answers are.

Surveys are extremely popular today because, relative to assessments of skill, they are inexpensive to develop and cost almost nothing to administer. Lectica gives away several high quality surveys for free because they are so inexpensive, yet organizations spend millions of dollars every year on surveys, many of which are falsely marketed as assessments of skill or competence.

Tests of factual and procedural knowledge

A test of competence is any test that asks the test taker to demonstrate a skill. Tests of factual and procedural knowledge can legitimately be thought of as tests of competence.

mc_itemThe classic multiple choice test examines factual knowledge, procedural knowledge, and basic comprehension. If you want to know if someone knows the rules, which formulas to apply, the steps in a process, or the vocabulary of a field, a multiple choice test may meet your needs. Often, the developers of multiple choice tests claim that their assessments measure understanding, reasoning, or critical thinking. This is because some multiple choice tests measure skills that are assumed to be proxies for skills like understanding, reasoning, and critical thinking. They are not direct tests of these skills.

Multiple choice tests are widely used, because there is a large industry devoted to making them, but they are increasingly unpopular because of their (mis)use as high stakes assessments. They are often perceived as threatening and unfair because they are often used to rank or select people, and are not helpful to the individual learner. Moreover, their relevance is often brought into question because they don't directly measure what we really care about—the ability to apply knowledge and skills in real-life contexts.

Performative tests

performative_itemTests that ask people to directly demonstrate their skills in (1) the real world, (2) real-world simulations, or (3) as they are applied to real-world scenarios are called performative tests. These tests usually do not have "right" answers. Instead, they employ objective criteria to evaluate performances for the level of skill demonstrated, and often play a formative role by providing feedback designed to improve performance or understanding. This is the kind of assessment you want if what you care about is deep understanding, reasoning skills, or performance in real-world contexts.

Performative tests are the most difficult tests to make, but they are the gold standard if what you want to know is the level of competence a person is likely to demonstrate in real-world conditions—and if you're interested in supporting development. Standardized performative tests are not yet widely used, because the methods and technology required to develop them are relatively new, and there is not yet a large industry devoted to making them. But they are increasingly popular because they support learning.

Unfortunately, performative tests may initially be perceived as threatening because people's attitudes toward tests of knowledge and skill have been shaped by their exposure to high stakes multiple choice tests. The idea of testing for learning is taking hold, but changing the way people think about something as ubiquitous as testing is an ongoing challenge.

Lectical Assessments

Lectical Assessments are performative tests—tests for learning. They are designed to support robust learning—the kind of learning that optimizes the growth of essential real-world skills. We're the leader of the pack when it comes to the sophistication of our methods and technology, our evidence base, and the sheer number of assessments we've developed.

[1] Frederick P. Morgeson, et al. (2007) Are we getting fooled again? Coming to terms with limitations in the use of personality tests for personnel selection, Personnel Psychology, 60, 1029-1033.

Please follow and like us:

Lectical (CLAS) scores are subject to change

feedback_loopWe incorporate feedback loops called virtuous cycles in everything we do. And I mean everything. Our governance structure is fundamentally iterative. (We're a Sociocracy.) Our project management approach is iterative. (We use Scrum.) We develop ideas iteratively. (We use Design Thinking.) We build our learning tools iteratively. (We use developmental maieutics.) And our learning model is iterative. (We use the virtuous cycle of learning.) One important reason for using all of these iterative processes is that we want every activity in our organization to reward learning. Conveniently, all of the virtuous cycles we iterate through do double duty as virtuous cycles of learning.

All of this virtuous cycling has an interesting (and unprecedented) side effect. The score you receive on one of our assessments is subject to change. Yes, because we learn from every single assessment taken in our system, what we learn could cause your score on any assessment you take here to change. Now, it's unlikley to change very much, probably not enough to affect the feedback you receive, but the fact that scores change from time to time can really shake people up. Some people might even think we've lost the plot!

But there is method in our madness. Allowing your score to fluctuate a bit as our knowledge base grows is our way of reminding everyone that there's uncertainty in any test score, and ourselves that there's always more to learn about how learning works. 

Please follow and like us:

The dark? side of Lectical Assessment

Recently, members of our team at Lectica have been discussing potential misuses of Lectical Assessments, and exploring the possibility that they could harm some students. There are serious concerns that require careful consideration and discussion, and I urge readers to pitch in.

One of the potential problems we've discussed is the possiblilty that students will compare their scores with one another, and that students with lower scores will suffer from these comparisons. Here's my current take on this issue.

Students receive scores all the time. By third grade they already know their position in the class hierarchy, and live everyday with that reality. Moreover, despite the popular notion that all students can become above average if they work hard enough, average students don't often become above average students, which means that during their entire 12 years of schooling, they rarely receive top rewards (the best grades) for the hard work they do. In fact, they often feel like they're being punished even when they try their best. To make things worse, in our current system they're further punished by being forced to memorize content they haven't been prepared to understand, a problem that worsens year by year.

Lectica's approach to assessment can't prevent students from figuring out where their scores land in the class distribution, but we can give all students an opportunity to see themselves as successful learners, no matter where their scores are in that distribution. Average or below average students may still have to live with the reality that they grow at different rates than some of their peers, but they'll be rewarded for their efforts, just the same.

I've been told by some very good teachers that it is unacceptable to use the expression "average student." While I share the instinct to protect students from the harm that can come from labels, I don't share the belief that being an average student is a bad thing. Most of us were average students—or to be more precise, 68% of us were within one standard deviation of the mean. How did being a member of the majority become a bad thing?  And what harm are we doing to students by creating the illusion that we are all capable of performing above the mean?

I don't think we hurt children by serving up reality. We hurt them when we mislead them by telling them they can all be above average, or when we make them feel hopeless by insisting that they all learn at the same pace, then punishing them when they can't keep up.

I'm not saying it's not possible to raise the average. We do it by meeting the specific learning needs of every student and making sure that learning time is spent learning robustly. But we can't change the fact that there's a distribution. And we shouldn't pretend this is the case.

Lectical Assessments are tests, and are subject to the same abuses as other tests. But they have three attributes that help mitigate these abuses. First, they allow all students without severe disabilities to see themselves as learners. Second, they help teachers customize instruction to meet the needs of each student, so more kids have a chance to achieve their full potential. And finally, they reward good pedagogy—even in cases in which the assessments are being misused. After all, testing drives instruction.

Please follow and like us:

Comparison of DiscoTests with conventional tests

DiscoTests and conventional standardized tests can be thought of as complementary. They are designed to test different kinds of skills, and research confirms that they are successful in doing so. Correlations between scores on the kind of developmental assessments made by DTS and scores on conventional multiple choice assessments is in the .40-.60 range. That means that somewhere between 16% to 36% of the kind of learning that is captured by conventional assessments is likely to overlap with the kind of learning that is captured by DiscoTests.

The table below provides a comparison of DiscoTests with conventional standardized tests on a number of dimensions.

Category
DiscoTests
Conventional tests
Theoretical foundation Cognitive developmental theory, Dynamic Skill Theory, Test theory Test theory
Scale Fischer’s Dynamic Skill Scale, an exhaustively researched general developmental scale, which is a member of a family of similar scales that were developed during the 20th century. Statistically generated scales, different for each test (though some tests are statistically linked)
Learning sequences Empirical, fine-grained & precise, calibrated to the dynamic skill scale Empirical, coarse-grained and general
Primary item type Open response More or less sophisticated forms of multiple choice
Targeted skills Reasoning with knowledge, knowledge application, making connections between new and existing knowledge, writing Content knowledge, procedural knowledge
Content Carefully selected “big ideas” and the concepts and skills associated with them. The full range of content specified in state standards for a given subject
Educative/formative Yes, (1) each DiscoTest focuses on ideas and skills central K-12 curricula, (2) test questions require students to thoughtfully apply new knowledge and connect it with their existing knowledge, (3) students receive reports with targeted feedback and learning suggestions, (4) teachers learn how student knowledge develops in general and on each targeted concept or skill. Not really, though increasingly claim to be
Embeddable in curricula Yes, DiscoTests are designed to be part of the curriculum. No
Standardized Yes, statistically, calibrated to the skill scale Yes, statistically only
Stakes Low. Selection decisions are based on performance patterns over time on many individual assessments. High. Selection decisions are often based on single assessments.
Ecological validity Direct tests that focus on deepening and connecting knowledge about key concepts and ideas, while developing broad skills that are required in adult life, such as those required for reasoning, communicating, and problem-solving. Tests of proxies, focus on ability to detect correct answers.
Statistical reliability .91+ for a single age cohort (distinguishes 5-6 distinct levels of performance). For high stakes tests, usually .95+ for a single age cohort (distinguishes 6-7 distinct levels of performance).
Please follow and like us: