Fleisch Kincaid and other reading level metrics are sometimes employed to compare the arguments made by politicians in their speeches, interviews, and writings. What are these metrics and what do they actually tell us about these verbal performances?
Fleisch Kincaid examines sentence, word length, and syllable number. Texts are considered “harder” when they have longer sentences and use words with more letters, and “easier” when they have shorter sentences and use words with fewer letters. For decades, Fleisch Kincaid and other reading level metrics have been used in word processors. When you are advised by a grammar checker that the reading level of your article is too high, it’s likely that this warning is based on word and sentence length.
Other reading level indicators, like Lexiles, use the commonness of words as an indicator. Texts are considered to be easier when the words they contain are more common, and more difficult when the words they contain are less common.
Because reading-level metrics are embedded in most grammar checkers, writers are continuously being encouraged to write shorter sentences with fewer, more common words. Writers for news media, advertisers, and politicians, all of whom care deeply about market share, work hard to create texts that meet specific “grade level” requirements. And if we are to judge by analyses of recent political speeches, this has considerably “dumbed down” political messages.
Weaknesses of reading level indicators
Reading level indicators look only at easy-to-measure things like length and frequency. But length and frequency are proxies for what they purport to measure—how easy it is to understand the meaning intended by the author.
Let’s start with word length. Words of the same length or number of syllables can have meanings that are more or less difficult to understand. The word, information has 4 syllables and 12 letters. The word, validity has 4 syllables and 8 letters. Which concept, information or validity, do you think is easier to understand? (Hint, one concept can’t be understood without a pretty rich understanding of the other.)
How about sentence length? These two sentences express the same meaning. “He was on fire.” “He was so angry that he felt as hot as a fire inside.” In this case, the short sentence is more difficult because it requires the reader to understand that it should be read within a context presented in an earlier sentence—”She really knew how to push his buttons.”
Finally, what about commonness? Well, there are many words that are less common but no more difficult to understand than other words. Take “giant” and “enormous.” The word, enormous doesn’t necessarily add meaning, it’s just used less often. It’s not harder, just less popular. And some relatively common words are more difficult to understand than less common words. For example, evolution is a common word with a complex meaning that’s quite difficult to understand, and onerous is an uncommon word that’s relatively easy to understand.
I’m not arguing that reducing sentence and word length and using more common words don’t make prose easier to understand, but metrics that use these proxies don’t actually measure understandability—or at least they don’t do it very well.
How reading level indicators relate to complexity level
When my colleagues and I analyze the complexity level of a text, we’re asking ourselves, “At what level does this person understand these concepts?” We’re looking for meaning, not word length or popularity. Level of complexity directly represents level of understanding.
Reading level indicators do correlate with complexity level. Correlations are generally within the range of .40 to .60, depending on the sample and reading level indicator. These are strong enough correlations to suggest that 16% to 36% of what reading-level indicators measure is the same thing we measure. In other words, they are weak measures of meaning. They are stronger measures of factors that impact readability, but are not related directly to meaning—sentence and word length and/or commonness.
Here’s an example of how all of this plays out in the real world: The New York Times is said to have a grade 7 Fleisch Kincaid reading level, on average. But complexity analyses of their articles yield scores of 1100-1145. In other words, these articles express meanings that we don’t see in assessment responses until college and beyond. This would explain why the New York Times audience tends to be college educated.
We would say that by reducing sentence and word length, New York Times writers avoid making complex ideas harder to understand.
Reading level indicators are flawed measures of understanding. They are also dinosaurs. When these tools were developed, we couldn’t do any better. But advances in technology, research methods, and the science of learning have taken us beyond proxies for understanding to direct measures of understanding. The next challenge is figuring out how to ensure that these new tools are used responsibly—for the good of all.
Shortly after the President passed the Montreal Cognitive Assessment, a reader emailed with two questions:
Does this mean that the President has the cognitive capacity required of a national leader?
How does a score on this test relate to the complexity level scores you have been describing in recent posts?
A high score on the Montreal Cognitive Assessment dos not mean that the President has the cognitive capacity required of a national leader. This test result simply means there is a high probability that the President is not suffering from mild cognitive impairment. (The test has been shown to detect existing cognitive impairment 88% of the time .) In order to determine if the President has the mental capacity to understand the complex issues he faces as a National Leader, we need to know how complexly he thinks about those issues.
The answer to the second question is that there is little relation between scores on the Montreal Cognitive Assessment and the complexity level of a person’s thinking. A test like the Montreal Cognitive Assessment does not require the kind of thinking a President needs to understand highly complex issues like climate change or the economy. Teenagers can easily pass this test.
The difference between 1053 and 1137 generally represents a decade or more of sustained learning. (If you’re a new reader and don’t yet know what a complexity level is, check out the National Leaders Series introductory article.)
 JAMA Intern Med. 2015 Sep;175(9):1450-8. doi: 10.1001/jamainternmed.2015.2152. Cognitive Tests to Detect Dementia: A Systematic Review and Meta-analysis. Tsoi KK, Chan JY, Hirai HW, Wong SY, Kwok TC.
There’s a battle out there no one’s tweeting about. It involves a tension between statistical significance and practical significance. If you make decisions that involve evaluating evidence—in other words, if you are human—understanding the distinction between these two types of significance will significantly improve your decisions (both practically and statistically).
Statistical significance (a.k.a. “p”) is a calculation made to determine how confident we can be that a relationship between two factors (variables) is real. The lower a p value, the more confident we can be. Most of the time, we want p to be less than .05.
Don’t be misled! A low p value tells us nothing about the size of a relationship between two variables. When someone says that statistical significance is high, all this means is that we can be more confident that the relationship is real.
Once we know we can be confident that a relationship between two variables is real, we should check to see if the research has been replicated. That’s because we can’t be sure a statistically significant relationship found in a single study is really real. After we’ve determined that a relationship is statistically significant and replicable, it’s time to consider practical significance. Practical significance has to do with the size of the relationship.
To figure out how practically significant a relationship is, we need to know how big it is. The size of a relationship, or effect size, is evaluated independently of p. For a plain English discussion of effect size, check out this article, Statistics for all: prediction.
The greater the size of a relationship between two variables, the more likely the relationship is to be important — but that’s not enough. To have real importance, a relationship must also matter. And it is the decision-maker who decides what matters.
Let’s look at one of my favorite examples. The results of high stakes tests like the SAT and GRE — college entrance exams made by ETS — have been shown to predict college success. Effect sizes tend to be small, but the effects are statistically significant — we can have confidence that they are real. And evidence for these effects have come from numerous studies, so we know they are really real.
If you’re the president of a college, there is little doubt that these test scores have practical significance. Improving prediction of student success, even a little, can have a big impact on the bottom line.
If you’re an employer, you’re more likely to care about how well a student did in college than how they did prior to college, so SAT and GRE scores are likely to be less important to you than college success.
If you’re a student, the size of the effect isn’t important at all. You don’t make the decision about whether or not the school is going to use the SAT or GRE to filter students. Whether or not these assessments are used is out of your control. What’s important to you is how a given college is likely to benefit you.
If you’re me, the size of the effect isn’t very important either. My perspective is that of someone who wants to see major changes in the educational system. I don’t think we’re doing our students any favors by focusing on the kind of learning that can be measured by tests like the GRE and SAT. I think our entire educational system leans toward the wrong goal—transmitting more and more “correct” information. I think we need to ask if what students are learning in school is preparing them for life.
Another thing to consider when evaluating practical significance is whether or not a relationship between two variables tells us only part of a more complex story. For example, the relationship between ethnicity and the rate of developmental growth (what my colleagues and I specialize in measuring) is highly statistically significant (real) and fairly strong (moderate effect size). But, this relationship completely disappears once socioeconomic status (wealth) is taken into account. The first relationship is misleading (spurious). The real culprit is poverty. It’s a social problem, not an ethnic problem.
Most discussions of practical significance stop with effect size. From a statistical perspective, this makes sense. Statistics can’t be used to determine which outcomes matter. People have to do that part, but statistics, when good ones are available, should come first. Here’s my recipe:
My organization, Lectica, Inc., is a 501(c)3 nonprofit corporation. Part of our mission is to share what we learn with the world. One of the things we’ve learned is that many assessment buyers don’t seem to know enough about statistics to make the best choices. The Statistics for all series is designed to provide assessment buyers with the knowledge they need most to become better assessment shoppers.
In the first post in this series, I promised to share a quick and dirty trick for determining how much confidence you can have in a test score. I will. But first, I want to show you a bit more about what estimating confidence means when it comes to educational and psychological tests.
Let’s start with a look at how test scores are usually reported. The figure below shows three scores, one at level 8, one at level 6, and one at level 4. Looking at this figure, most of us would be inclined to assume that these scores are what they seem to be—precise indicators of the level of a trait or skill.
But this is not the case. Test scores are fuzzy. They’re best understood as ranges ratherthan as points on a ruler. In other words, test scores are always surrounded by confidence intervals. A person’s true score is likely to fall somewhere in the range described by the confidence interval around a test score.
In order to figure out how fuzzy a test score actually is, you need one thing—an indicator of statistical reliability. Most of the time, this is something called Cronbach’s Alpha. All good test developers publish information about the statistical reliability of their measures, ideally in refereed academic journals with easy to find links on their web sites! If a test developer won’t provide you with information about Alpha (or its equivalent) for each score reported on a test, it’s best to move on.
The higher the reliability (usually Alpha) the smaller the confidence interval. And the smaller the confidence interval, the more confidence you can have in a test score.
The table below will help to clarify why it is important to know Alpha (or its equivalent). It shows the relationship between Alpha (which can range from 0 to 1.0) and the number of distinct levels (strata) a test can be said to have. For example, an assessment with a reliability of .80, has 3 strata, whereas an assessment with a reliability of .94 has 5.
Strata have direct implications for the confidence we can have in a person’s score on a given assessment, because they tell us about the range within which a person’s true score would fall—its confidence interval—given the score awarded.
Imagine that you have just taken a test of emotional intelligence with a score range of 1 to 10 and a reliability of .95. The number of strata into which an assessment with a reliability of .95 can be divided is about 6, which means that each strata equals about 1.75 points on the 10 point scale (10 divided by 6). If your score on this test was 8, your true score would likely be somewhere between 7.13 and 8.88—your score’s confidence interval.
The figure below shows the true score ranges for three test takers, CB, RM, and PR. The fact that these ranges don’t overlap gives us confidence that the emotional intelligence of these test-takers is actually different**.
If these scores were closer together, their confidence intervals would overlap. And if that was the case—for example if you were comparing two individuals with scores of 8 and 8.5—it would not be correct to say the scores were different form one another. In fact, it would be incorrect for a hiring manager to consider the difference between a score of 8 and a score of 8.5 in making a choice between two job candidates.
By the way, tests with Alphas in the range of .94 or higher are considered suitable for high-stakes use (assuming that they meet other essential validity requirements). What you see in the figure below is about as good as it gets in educational and psychological assessment.
Most assessments used in organizations do not have Alphas that are anywhere near .95. Some of the better assessments have Alphas as high as .85. Let’s take a look at what an Alpha at this level does to confidence intervals.
If the test you have taken has a score range of 1–10 and an Alpha (reliability) of .85, the number of strata into which this assessment can be divided is about 3.4, which means that each strata equals about 2.9 (10 divided by 3.4) points on the 10 point scale. In this case, if you receive a score of 8, your true score is likely to fall within the range of 6.6 to 9.5*.
In the figure below, note that CB’s true score range now overlaps RM’s true score range and RM’s true score range overlaps PR’s true score range. This means we cannot say—with confidence—that CB’s score is different from RM’s score, or that RM’s score is different from PR’s score.
Assessments with Alphas in the .85 range are suitable for classroom use or low-stakes contexts. Yet, every day, schools and businesses use tests with reliabilities in the .85 range to make high stakes decisions—such as who will be selected for advancement or promotion. And this is often done in a way that would exclude RM (yellow circle) even though his confidence interval overlaps CB’s (teal circle) confidence interval.
Many tests used in organizations have Alphas in the .75 range. If the test you have taken has a score range of 1–10 and an Alpha of .75, the number of strata into which this assessment can be divided is about 2.2, which means that each strata equals about 4.5 points on the 10 point scale. In this case, if you receive a score of 8, your true score is likely to fall within the range of 6–10*.
As shown in the figure below, scores would now have to differ by at least 4.5 points in order for us to distinguish between two people. CB’s and PR’s scores are different, but RM’s score is uninterpretable.
Tests or sub-scales with alphas in the .75 range are considered suitable for research purposes. Yet, sad to say, schools and businesses now use tests with scales or sub-scales that have Alphas in or below the .75 range, treating these scores as if they provide useful information, when in most cases the scores—like RM’s—are uninterpretable.
If your current test providers are not reporting true score ranges (confidence intervals), ask for them. If they only provide Alphas (reliability statistics) you can use the table and figures in this article to calculate true score ranges for yourself. If you don’t want to do the math, no problem. You can use the figures above to get a feel for how precise a score is.
Statistical reliability is only one of the ways in which assessments should be evaluated. Test developers should also ask how well an assessment measures what it is intended to measure. And those who use an assessment should ask whether or not what it measures is relevant or important. I’ll be sharing some tricks for looking at these forms of validity in future articles.
This morning, I received a newsletter from Sir Ken Robinson, a popular motivational speaker who focuses on education. There was a return email address, so I wrote to him. Here's what I wrote:
Dear Sir Ken,
"I love your message. I'm one of the worker bees who's trying to leverage the kind of changes you envision.
After 20+ years of hard work, my colleagues and I have reinvented educational assessment. No multiple choice. No high stakes. Our focus is on assessment for learning—supporting students in learning joyfully and deeply in a way that facilitates skills for learning, thinking, inquiring, relating and otherwise navigating a complex world. Our assessments are scalable and standardized, but they do not homogenize. They are grounded in a deep study of the many pathways through which students learn key skills and concepts. We're documenting, in exquisite (some would say insane) detail, how concepts and skills develop over time so we can gain insight into learners' knowledge networks. We don't ask about correctness. We ask about understanding and competence and how they develop over time. And we help teachers meet students "where they're at."
We've accumulated a strong base of evidence to support these claims. But now that we're ready to scale, we're running up against hostility toward all standardized assessment. It's difficult to get to the point where we can even have a conversation with our pedagogical allies. Ouch!
Lectica is organized as a nonprofit so we can guarantee that the underprivileged are served first. We plan to offer subscriptions to our assessments (learning tools) without charge to individual teachers everywhere.
We've kept our heads down as we've developed our methods and technology. Now we're scaling and want to be seen. We know we're part of the solution to today's educational crisis—perhaps a very big part of the solution. I'm hoping you'd like to learn more."
My email was returned with this message: "The email account that you tried to reach does not exist." How frustrating.
So, I thought I'd pen this post and ask my friends and colleagues to help me get access to Sir Ken's ear. If you know him, please forward this message. I'm certain he'll be interested in what we're doing for learning and development. Where are you Sir Ken Robinson? Can you hear me? Are you out there?
I’ve been auditing a very popular 4.5 star Coursera course called “Learning how to learn.” It uses all of the latest research to help people improve their “learning skills.” Yet, even though the lectures in the course are interesting and the research behind the course appears to be sound, I find it difficult to agree that it is a course that helps people learn how to learn.
First, the tests used to determine how well participants have built the learning skills described in this course are actually tests of how well they have learned vocabulary and definitions. As far as I can tell, no skills are involved other than the ability to recall course content. This is problematic. The assumption that learning vocabulary and definitions builds skill is unwarranted. I believe we all know this. Who has not had the experience of learning something well enough to pass a test only to forget most of what they had learned shortly thereafter?
Second, the content in tests at the end of the videos aren’t particularly relevant to the stated intention of the course. These tests require remembering (or scrolling back to) facts like “Many new synapses are formed on dendrites.” We do not need to learn this to become effective learners. The test item for which this is the correct answer is focused on an aspect of how learning works rather than how to learn. And although understanding how learning works might be a step toward learning how to learn, answering this question correctly doesn’t tell us how the participant understands anything at all.
Third, if the course developers had used tests of skill—tests that asked participants to show off how effectively they could apply described techniques, we would be able to ask about the extent to which the course helps participants learn how to learn. Instead, the only way we have to evaluate the effectiveness of the course is through participant ratings and comments—how much people like it. I’m not suggesting that liking a course is unimportant, but it’s not a good way to evaluate its effectiveness.
Fourth, the course seems to be primarily concerned with fostering a kind of learning that helps people do better on tests of correctness. The underlying and unstated assumption seems to be that if you can do better on these tests, you have learned better. This assumption flies in the face of several decades of educational research, including our own [for example, 1, 2, 3]. Correctness is not adequate evidence of understanding or real-world skill. If we want to know how well people understand new knowledge, we must observe how they apply this knowledge in real-world contexts. If we want to evaluate their level of skill, we must observe how well they apply the skill in real-world contexts. In other words, a course in learning how to learn—especially a course in learning how to learn—should be building useable skills that have value beyond the act of passing a test of correctness.
Fifth, the research behind this course can help us understand how learning works. At Lectica, we’ve used the very same information as part of the basis for our learning model, VCoL+7. But instead of using this knowledge to support the status quo—an educational system that privileges correctness over understanding and skill—we’re using it to build learning tools designed to ensure that learning in school goes beyond correctness to build deep understanding and robust skill.
For the vast majority of people, schooling is not an end in itself. It is preparation for life—preparation with tomorrow’s skills. It’s time we held our educational institutions accountable for ensuring that students know how to learn more than correct answers. Wherever their lives take them, they will do better if equipped with understanding and skill. Correctness is not enough.
 FairTest; Mulholland, Quinn (2015). The case against standardized testing. Harvard Political Review, May 14.
 Schwartz, M. S., Sadler, P. M., Sonnert, G. & Tai, R. H. (2009). Depth versus breadth: How content coverage in high school science courses relates to later success in college science coursework. Science Education, 93, 5, 798-826.
During the 70s and 80s I practiced midwifery. It was a great honor to be present at the births of over 500 babies, and in many cases, follow them into childhood. Every single one of those babies was a joyful, driven, and effective "every moment" learner. Regardless of difficulty and pain they all learned to walk, talk, interact with others, and manipulate many aspects of their environment. They needed few external rewards to build these skills—the excitement and suspense of striving seemed to be reward enough. I felt like I was observing the "life force" in action.
Unfortunately as many of these children approached the third grade (age 8), I noticed something else—something deeply troubling. Many of the same children seemed to have lost much of this intrinsic drive to learn. For them, learning had become a chore motivated primarily by extrinsic rewards and punishments. Because this was happening primarily to children attending conventional schools (Children receiving alternative instruction seemed to be exempt.) it appeared that something about schooling was depriving many children of the fundamental human drive required to support a lifetime of learning and development—a drive that looked to me like a key source of happiness and fulfillment.
Understanding the problem
Following upon my midwifery career, I flirted briefly with a career in advertising, but by the early 90's I was back in school—in a Ph.D. program in U. C. Berkeley's Graduate School of Education—where I found myself observing the same pattern I'd observed as a midwife. Both the research and my own lab experience exposed the early loss of students' natural love of learning. My concern was only increased by the newly emerging trend toward high stakes multiple choice testing, which my colleagues and I saw as a further threat to children's natural drive to learn.
Most of the people I've spoken to about this problem have agreed that it's a shame, but few have seen it as a problem that can be solved, and many have seen it as an inevitable consequence of either mass schooling or simple maturation. But I knew it was not inevitable. Children and those educated in a range of alternative environments did not appear to lose their drive to learn. Additionally, above average students in conventional schools appeared to be more likely to retain their love of learning.
I set out to find out why—and ended up on a long journey toward a solution.
How learning works
First, I needed to understand how learning works. At Berkeley, I studied a wide variety of learning theories in several disciplines, including developmental theories, behavioral theories, and brain-based theories. I collected a large database of longitudinal interviews and submitted them to in-depth analysis, looked closely at the relation between testing and learning, and studied psychological measurement, all in the interest of finding a way to support childrens' growth while reinforcing their love of learning.
My dissertation—which won awards from both U.C. Berkeley and the American Psychological Association—focused on the development of people's conceptions of learning from age 5 through 85, and how this kind of knowledge could be used to measure and support learning. In 1998, I received $500,000 from the Spencer Foundation to further develop the methods designed for this research. Some of my areas of expertise are human learning and development, psychometrics, metacognition, moral education, and research methods.
In the simplest possible terms, what I learned in 5 years of graduate school is that the human brain is designed to drive learning, and that preserving that natural drive requires 5 ingredients:
a safe environment that is rich in learning opportunities and healthy human interaction,
a teacher who understands each child's interests and level of tolerance for failure,
a mechanism for determining "what comes next"—what is just challenging enough to allow for success most of the time (but not all of the time),
instant actionable feedback, and
the opportunity to integrate new knowledge or skills into each learner's existing knowledge network well enough to make it useable before pushing instruction to the next level. (We call this building a "robust knowledge network"—the essential foundation for future learning.)*
Identifying the solution
Once we understood what learning should look like, we needed to decide where to intervene. The answer, when it came, was a complete surprise. Understanding what comes next—something that can only be learned by measuring what a student understands now—was an integral part of the recipe for learning. This meant that testing—which we originally saw as an obstacle to robust learning—was actually the solution—but only if we could build tests that would free students to learn the way their brains are designed to learn. These tests would have to help teachers determine "what comes next" (ingredient 3) and provide instant actionable feedback (ingredient 4), while rewarding them for helping students build robust knowledge networks (ingredient 5).
Unfortunately, conventional standardized tests were focused on "correctness" rather than robust learning, and none of them were based on the study of how targeted concepts and skills develop over time. Moreover, they were designed not to support learning, but rather to make decisions about advancement or placement, based on how many correct answers students were able to provide relative to other students. Because this form of testing did not meet the requirements of our learning recipe, we'd have to start from scratch.
Developing the solution
We knew that our solution—reinventing educational testing to serve robust learning—would require many years of research. In fact, we would be committing to possible decades of effort without a guaranteed result. It was the vision of a future educational system in which all children retained their inborn drive for learning that ultimately compelled us to move forward.
To reinvent educational testing, we needed to:
make a deep study of precisely how children build particular knowledge and skills over time in a wide range of subject areas (so these tests could accurately identify "what comes next");
make tests that determine how deeply students understand what they have learned—how well they can use it to address real-world issues or problems (requires that students show how they are thinking, not just what they know—which means written responses with explanations); and
produce formative feedback and resources designed to foster "robust learning" (build robust knowledge networks).
Here's what we had to invent:
A learning ruler (building on Commons  and Fischer );
A method for studying how students learn tested concepts and skills (refining the methods developed for my dissertation);
A human scoring system for determining the level of understanding exhibited in students' written explanations (building upon Commons' and Fischer's methods, refining them until measurements were precise enough for use in educational contexts); and
An electronic scoring system, so feedback and resources could be delivered in real time.
It took over 20 years (1996–2016), but we did it! And while we were doing it, we conducted research. In fact, our assessments have been used in dozens of research projects, including a 25 million dollar study of literacy conducted at Harvard, and numerous Ph.D. dissertations—with more on the way.
What we've learned
We've learned many things from this research. Here are some that took us by surprise:
Students in schools that focus on building deep understanding graduate seniors that are up to 5 years ahead (on our learning ruler) of students in schools that focus on correctness (2.5 to 3 years after taking socioeconomic status into account).
Students in schools that foster robust learning develop faster and continue to develop longer (into adulthood) than students in schools that focus on correctness.
On average, students in schools that foster robust learning produce more coherent and persuasive arguments than students in schools that focus on correctness.
On average, students in our inner-city schools, which are the schools most focused on correctness, stop developing (on our learning ruler) in grade 10.
The average student who graduates from a school that strongly focuses on correctness is likely, in adulthood, to (1) be unable to grasp the complexity and ambiguity of many common situations and problems, (2) lack the mental agility to adapt to changes in society and the workplace, and (3) dislike learning.
From our perspective, these results point to an educational crisis that can best be addressed by allowing students to learn as their brains were designed to learn. Practically speaking, this means providing learners, parents, teachers, and schools with metrics that reward and support teaching that fosters robust learning.
Where we are today
Lectica has created the only metrics that meet all of these requirements. Our mission is to foster greater individual happiness and fulfillment while preparing students to meet 21st century challenges. We do this by creating and delivering learning tools that encourage students to learn the way their brains were designed to learn. And we ensure that students who need our learning tools the most get them first by providing free subscriptions to individual teachers everywhere.
To realize our mission, we organized as a nonprofit. We knew this choice would slow our progress (relative to organizing as a for-profit and welcoming investors), but it was the only way to guarantee that our true mission would not be derailed by other interests.
Thus far, we've funded ourselves with work in the for-profit sector and income from grants. Our background research is rich, our methods are well-established, and our technology works even better than we thought it would. Last fall, we completed a demonstration of our electronic scoring system, CLAS, a novel technology that learns from every single assessment taken in our system.
The groundwork has been laid, and we're ready to scale. All we need is the platform that will deliver the assessments (called DiscoTests), several of which are already in production.
After 20 years of high stakes testing, students and teachers need our solution more than ever. We feel compelled to scale a quickly as possible, so we can begin the process of reinvigorating today's students' natural love of learning, and ensure that the next generation of students never loses theirs. Lectica's story isn't finished. Instead, we find ourselves on the cusp of a new beginning!
A final note: There are many benefits associated with our approach to assessment that were not mentioned here. For example, because the assessment scores are all calibrated to the same learning ruler, students, teachers, and parents can easily track student growth. Even better, our assessments are designed to be taken frequently and to be embedded in low-stakes contexts. For grading purposes, teachers are encouraged to focus on growth over time rather than specific test scores. This way of using assessments pretty much eliminates concerns about cheating. And finally, the electronic scoring system we developed is backed by the world's first "taxonomy of learning," which also serves many other educational and research functions. It's already spawned a developmentally sensitive spell-checker! One day, this taxonomy of learning will be robust enough to empower teachers to create their own formative assessments on the fly.
Adaptive learning technologies are touted as an advance in education and a harbinger of what's to come. But although we at Lectica agree that adaptive learning has a great deal to offer, we have some concerns about its current limitations. In an earlier article, I raised the question of how well one of these platforms, Knewton, serves "robust learning"—the kind of learning that leads to deep understanding and usable knowledge. Here are some more general observations.
The great strength of adaptive learning technologies is that they allow students to learn at their own pace. That's big. It's quite enough to be excited about, even if it changes nothing else about how people learn. But in our excitement about this advance, the educational community is in danger of ignoring important shortcomings of these technologies.
First, adaptive learning technologies are built on adaptive testing technologies. Today, these testing technologies are focused on "correctness." Students are moved to the next level of difficulty based on their ability to get correct answers. This is what today's testing technologies measure best. However, although being able to produce or select correct answers is important, it is not an adequate indication of understanding. And without real understanding, knowledge is not usable and can't be built upon effectively over the long term.
Second, today's adaptive learning technologies are focused on a narrow range of content—the kind of content psychometricians know how to build tests for—mostly math and science (with an awkward nod to literacy). In public education during the last 20 years, we've experienced a gradual narrowing of the curriculum, largely because of high stakes testing and its narrow focus. Today's adaptive learning technologies suffer from the same limitations and are likely to reinforce this trend.
Third, the success of adaptive learning technologies is measured with standardized tests of correctness. Higher scores will help more students get into college—after all, colleges use these tests to decide who will be admitted. But we have no idea how well higher scores on these tests translate into life success. Efforts to demonstrate the relevance of educational practices are few and far between. And notably, there are many examples of highly successful individuals who were poor players in the education game—including several of the worlds' most productive and influential people.
Fourth, some proponents of online adaptive learning believe that it can and should replace (or marginalize) teachers and classrooms. This is concerning. Education is more than a process of accumulating facts. For one thing, it plays an enormous role in socialization. Good teachers and classrooms offer students opportunities to build knowledge while learning how to engage and work with diverse others. Great teachers catalyze optimal learning and engagement by leveraging students' interests, knowledge, skills, and dispositions. They also encourage students to put what they're learning to work in everyday life—both on their own and in collaboration with others.
Lectica has a strong interest in adaptive learning and the technologies that deliver it. We anticipate that over the next few years, our assessment technology will be integrated into adaptive learning platforms to help expand their subject matter and ensure that students are building robust, usable knowledge. We will also be working hard to ensure that these platforms are part of a well-thought out, evidence-based approach to education—one that fosters the development of tomorrow's skills—the full range of skills and knowledge required for success in a complex and rapidly changing world.
We've been hearing quite a bit about the "proficiency vs. growth" debate since Betsy DeVos (Trump's candidate for Education Secretary) was asked to weigh in last week. This debate involves a disagreement about how high stakes tests should be used to evaluate educational programs. Advocates for proficiency want to reward schools when their students score higher on state tests. Advocates for growth want to reward schools when their students grow more on state tests. Readers who know about Lectica's work can guess where we'd land in this debate—we're outspokenly growth-minded.
For us, however, the proficiency vs. growth debate is only a tiny piece of a broader issue about what counts as learning. Here's a sketch of the situation as we see it:
Getting a higher score on a state test means that you can get more correct answers on increasingly difficult questions, or that you can more accurately apply writing conventions or decode texts. But these aren't the things we really want to measure. They're "proxies"—approximations of our real learning objectives. Test developers measure proxies because they don't know how to measure what we really want to know.
What we really want to know is how well we're preparing students with the skills and knowledge they'll need to successfully navigate life and work.
Scores on conventional tests predict how well students are likely to perform, in the future, on conventional tests. But scores on these tests have not been shown to be good predictors of success in life.*
In light of this glaring problem with conventional tests, the debate between proficiency and growth is a bit of a red herring. What we really need to be asking ourselves is a far more fundamental question:
What knowledge and skills will our children need to navigate the world of tomorrow, and how can we best nurture their development?
That's the question that frames our work here at Lectica.
*For information about the many problems with conventional tests, see FairTest.