Fleisch Kincaid and other reading level metrics are sometimes employed to compare the arguments made by politicians in their speeches, interviews, and writings. What are these metrics and what do they actually tell us about these verbal performances?
Fleisch Kincaid examines sentence, word length, and syllable number. Texts are considered “harder” when they have longer sentences and use words with more letters, and “easier” when they have shorter sentences and use words with fewer letters. For decades, Fleisch Kincaid and other reading level metrics have been used in word processors. When you are advised by a grammar checker that the reading level of your article is too high, it’s likely that this warning is based on word and sentence length.
Other reading level indicators, like Lexiles, use the commonness of words as an indicator. Texts are considered to be easier when the words they contain are more common, and more difficult when the words they contain are less common.
Because reading-level metrics are embedded in most grammar checkers, writers are continuously being encouraged to write shorter sentences with fewer, more common words. Writers for news media, advertisers, and politicians, all of whom care deeply about market share, work hard to create texts that meet specific “grade level” requirements. And if we are to judge by analyses of recent political speeches, this has considerably “dumbed down” political messages.
Weaknesses of reading level indicators
Reading level indicators look only at easy-to-measure things like length and frequency. But length and frequency are proxies for what they purport to measure—how easy it is to understand the meaning intended by the author.
Let’s start with word length. Words of the same length or number of syllables can have meanings that are more or less difficult to understand. The word, information has 4 syllables and 12 letters. The word, validity has 4 syllables and 8 letters. Which concept, information or validity, do you think is easier to understand? (Hint, one concept can’t be understood without a pretty rich understanding of the other.)
How about sentence length? These two sentences express the same meaning. “He was on fire.” “He was so angry that he felt as hot as a fire inside.” In this case, the short sentence is more difficult because it requires the reader to understand that it should be read within a context presented in an earlier sentence—”She really knew how to push his buttons.”
Finally, what about commonness? Well, there are many words that are less common but no more difficult to understand than other words. Take “giant” and “enormous.” The word, enormous doesn’t necessarily add meaning, it’s just used less often. It’s not harder, just less popular. And some relatively common words are more difficult to understand than less common words. For example, evolution is a common word with a complex meaning that’s quite difficult to understand, and onerous is an uncommon word that’s relatively easy to understand.
I’m not arguing that reducing sentence and word length and using more common words don’t make prose easier to understand, but metrics that use these proxies don’t actually measure understandability—or at least they don’t do it very well.
How reading level indicators relate to complexity level
When my colleagues and I analyze the complexity level of a text, we’re asking ourselves, “At what level does this person understand these concepts?” We’re looking for meaning, not word length or popularity. Level of complexity directly represents level of understanding.
Reading level indicators do correlate with complexity level. Correlations are generally within the range of .40 to .60, depending on the sample and reading level indicator. These are strong enough correlations to suggest that 16% to 36% of what reading-level indicators measure is the same thing we measure. In other words, they are weak measures of meaning. They are stronger measures of factors that impact readability, but are not related directly to meaning—sentence and word length and/or commonness.
Here’s an example of how all of this plays out in the real world: The New York Times is said to have a grade 7 Fleisch Kincaid reading level, on average. But complexity analyses of their articles yield scores of 1100-1145. In other words, these articles express meanings that we don’t see in assessment responses until college and beyond. This would explain why the New York Times audience tends to be college educated.
We would say that by reducing sentence and word length, New York Times writers avoid making complex ideas harder to understand.
Reading level indicators are flawed measures of understanding. They are also dinosaurs. When these tools were developed, we couldn’t do any better. But advances in technology, research methods, and the science of learning have taken us beyond proxies for understanding to direct measures of understanding. The next challenge is figuring out how to ensure that these new tools are used responsibly—for the good of all.
There’s a battle out there no one’s tweeting about. It involves a tension between statistical significance and practical significance. If you make decisions that involve evaluating evidence—in other words, if you are human—understanding the distinction between these two types of significance will significantly improve your decisions (both practically and statistically).
Statistical significance (a.k.a. “p”) is a calculation made to determine how confident we can be that a relationship between two factors (variables) is real. The lower a p value, the more confident we can be. Most of the time, we want p to be less than .05.
Don’t be misled! A low p value tells us nothing about the size of a relationship between two variables. When someone says that statistical significance is high, all this means is that we can be more confident that the relationship is real.
Once we know we can be confident that a relationship between two variables is real, we should check to see if the research has been replicated. That’s because we can’t be sure a statistically significant relationship found in a single study is really real. After we’ve determined that a relationship is statistically significant and replicable, it’s time to consider practical significance. Practical significance has to do with the size of the relationship.
To figure out how practically significant a relationship is, we need to know how big it is. The size of a relationship, or effect size, is evaluated independently of p. For a plain English discussion of effect size, check out this article, Statistics for all: prediction.
The greater the size of a relationship between two variables, the more likely the relationship is to be important — but that’s not enough. To have real importance, a relationship must also matter. And it is the decision-maker who decides what matters.
Let’s look at one of my favorite examples. The results of high stakes tests like the SAT and GRE — college entrance exams made by ETS — have been shown to predict college success. Effect sizes tend to be small, but the effects are statistically significant — we can have confidence that they are real. And evidence for these effects have come from numerous studies, so we know they are really real.
If you’re the president of a college, there is little doubt that these test scores have practical significance. Improving prediction of student success, even a little, can have a big impact on the bottom line.
If you’re an employer, you’re more likely to care about how well a student did in college than how they did prior to college, so SAT and GRE scores are likely to be less important to you than college success.
If you’re a student, the size of the effect isn’t important at all. You don’t make the decision about whether or not the school is going to use the SAT or GRE to filter students. Whether or not these assessments are used is out of your control. What’s important to you is how a given college is likely to benefit you.
If you’re me, the size of the effect isn’t very important either. My perspective is that of someone who wants to see major changes in the educational system. I don’t think we’re doing our students any favors by focusing on the kind of learning that can be measured by tests like the GRE and SAT. I think our entire educational system leans toward the wrong goal—transmitting more and more “correct” information. I think we need to ask if what students are learning in school is preparing them for life.
Another thing to consider when evaluating practical significance is whether or not a relationship between two variables tells us only part of a more complex story. For example, the relationship between ethnicity and the rate of developmental growth (what my colleagues and I specialize in measuring) is highly statistically significant (real) and fairly strong (moderate effect size). But, this relationship completely disappears once socioeconomic status (wealth) is taken into account. The first relationship is misleading (spurious). The real culprit is poverty. It’s a social problem, not an ethnic problem.
Most discussions of practical significance stop with effect size. From a statistical perspective, this makes sense. Statistics can’t be used to determine which outcomes matter. People have to do that part, but statistics, when good ones are available, should come first. Here’s my recipe:
My organization, Lectica, Inc., is a 501(c)3 nonprofit corporation. Part of our mission is to share what we learn with the world. One of the things we’ve learned is that many assessment buyers don’t seem to know enough about statistics to make the best choices. The Statistics for all series is designed to provide assessment buyers with the knowledge they need most to become better assessment shoppers.
In the first post in this series, I promised to share a quick and dirty trick for determining how much confidence you can have in a test score. I will. But first, I want to show you a bit more about what estimating confidence means when it comes to educational and psychological tests.
Let’s start with a look at how test scores are usually reported. The figure below shows three scores, one at level 8, one at level 6, and one at level 4. Looking at this figure, most of us would be inclined to assume that these scores are what they seem to be—precise indicators of the level of a trait or skill.
But this is not the case. Test scores are fuzzy. They’re best understood as ranges ratherthan as points on a ruler. In other words, test scores are always surrounded by confidence intervals. A person’s true score is likely to fall somewhere in the range described by the confidence interval around a test score.
In order to figure out how fuzzy a test score actually is, you need one thing—an indicator of statistical reliability. Most of the time, this is something called Cronbach’s Alpha. All good test developers publish information about the statistical reliability of their measures, ideally in refereed academic journals with easy to find links on their web sites! If a test developer won’t provide you with information about Alpha (or its equivalent) for each score reported on a test, it’s best to move on.
The higher the reliability (usually Alpha) the smaller the confidence interval. And the smaller the confidence interval, the more confidence you can have in a test score.
The table below will help to clarify why it is important to know Alpha (or its equivalent). It shows the relationship between Alpha (which can range from 0 to 1.0) and the number of distinct levels (strata) a test can be said to have. For example, an assessment with a reliability of .80, has 3 strata, whereas an assessment with a reliability of .94 has 5.
Strata have direct implications for the confidence we can have in a person’s score on a given assessment, because they tell us about the range within which a person’s true score would fall—its confidence interval—given the score awarded.
Imagine that you have just taken a test of emotional intelligence with a score range of 1 to 10 and a reliability of .95. The number of strata into which an assessment with a reliability of .95 can be divided is about 6, which means that each strata equals about 1.75 points on the 10 point scale (10 divided by 6). If your score on this test was 8, your true score would likely be somewhere between 7.13 and 8.88—your score’s confidence interval.
The figure below shows the true score ranges for three test takers, CB, RM, and PR. The fact that these ranges don’t overlap gives us confidence that the emotional intelligence of these test-takers is actually different**.
If these scores were closer together, their confidence intervals would overlap. And if that was the case—for example if you were comparing two individuals with scores of 8 and 8.5—it would not be correct to say the scores were different form one another. In fact, it would be incorrect for a hiring manager to consider the difference between a score of 8 and a score of 8.5 in making a choice between two job candidates.
By the way, tests with Alphas in the range of .94 or higher are considered suitable for high-stakes use (assuming that they meet other essential validity requirements). What you see in the figure below is about as good as it gets in educational and psychological assessment.
Most assessments used in organizations do not have Alphas that are anywhere near .95. Some of the better assessments have Alphas as high as .85. Let’s take a look at what an Alpha at this level does to confidence intervals.
If the test you have taken has a score range of 1–10 and an Alpha (reliability) of .85, the number of strata into which this assessment can be divided is about 3.4, which means that each strata equals about 2.9 (10 divided by 3.4) points on the 10 point scale. In this case, if you receive a score of 8, your true score is likely to fall within the range of 6.6 to 9.5*.
In the figure below, note that CB’s true score range now overlaps RM’s true score range and RM’s true score range overlaps PR’s true score range. This means we cannot say—with confidence—that CB’s score is different from RM’s score, or that RM’s score is different from PR’s score.
Assessments with Alphas in the .85 range are suitable for classroom use or low-stakes contexts. Yet, every day, schools and businesses use tests with reliabilities in the .85 range to make high stakes decisions—such as who will be selected for advancement or promotion. And this is often done in a way that would exclude RM (yellow circle) even though his confidence interval overlaps CB’s (teal circle) confidence interval.
Many tests used in organizations have Alphas in the .75 range. If the test you have taken has a score range of 1–10 and an Alpha of .75, the number of strata into which this assessment can be divided is about 2.2, which means that each strata equals about 4.5 points on the 10 point scale. In this case, if you receive a score of 8, your true score is likely to fall within the range of 6–10*.
As shown in the figure below, scores would now have to differ by at least 4.5 points in order for us to distinguish between two people. CB’s and PR’s scores are different, but RM’s score is uninterpretable.
Tests or sub-scales with alphas in the .75 range are considered suitable for research purposes. Yet, sad to say, schools and businesses now use tests with scales or sub-scales that have Alphas in or below the .75 range, treating these scores as if they provide useful information, when in most cases the scores—like RM’s—are uninterpretable.
If your current test providers are not reporting true score ranges (confidence intervals), ask for them. If they only provide Alphas (reliability statistics) you can use the table and figures in this article to calculate true score ranges for yourself. If you don’t want to do the math, no problem. You can use the figures above to get a feel for how precise a score is.
Statistical reliability is only one of the ways in which assessments should be evaluated. Test developers should also ask how well an assessment measures what it is intended to measure. And those who use an assessment should ask whether or not what it measures is relevant or important. I’ll be sharing some tricks for looking at these forms of validity in future articles.
In a recent blog post—actually in several recent blog posts—I've been emphasizing the importance of building tomorrow's skills. These are the kinds of skills we all need to navigate our increasingly complex and changing world. While I may not agree that all of the top 10 skills listed in the World Economic Forum report (shown above) belong in a list of skills (Creativity is much more than a skill, and service orientation is more of a disposition than a skill.) the flavor of this list is generally in sync with the kinds of skills, dispositions, and behaviors required in a complex and rapidly changing world.
The "skills" in this list cannot be…
developed in learning environments focused primarily on correctness or in workplace environments that don't allow for mistakes; or
These "skills" are best developed through cycles of goal setting, information gathering, application, and reflection—what we call virtuous cycles of learning—or VCoLs. And they're best assessed with tests that focus on applications of skill in real-world contexts, like Lectical Assessments, which are based on a rich research tradition focused on the development of understanding and skill.
I’ve been auditing a very popular 4.5 star Coursera course called “Learning how to learn.” It uses all of the latest research to help people improve their “learning skills.” Yet, even though the lectures in the course are interesting and the research behind the course appears to be sound, I find it difficult to agree that it is a course that helps people learn how to learn.
First, the tests used to determine how well participants have built the learning skills described in this course are actually tests of how well they have learned vocabulary and definitions. As far as I can tell, no skills are involved other than the ability to recall course content. This is problematic. The assumption that learning vocabulary and definitions builds skill is unwarranted. I believe we all know this. Who has not had the experience of learning something well enough to pass a test only to forget most of what they had learned shortly thereafter?
Second, the content in tests at the end of the videos aren’t particularly relevant to the stated intention of the course. These tests require remembering (or scrolling back to) facts like “Many new synapses are formed on dendrites.” We do not need to learn this to become effective learners. The test item for which this is the correct answer is focused on an aspect of how learning works rather than how to learn. And although understanding how learning works might be a step toward learning how to learn, answering this question correctly doesn’t tell us how the participant understands anything at all.
Third, if the course developers had used tests of skill—tests that asked participants to show off how effectively they could apply described techniques, we would be able to ask about the extent to which the course helps participants learn how to learn. Instead, the only way we have to evaluate the effectiveness of the course is through participant ratings and comments—how much people like it. I’m not suggesting that liking a course is unimportant, but it’s not a good way to evaluate its effectiveness.
Fourth, the course seems to be primarily concerned with fostering a kind of learning that helps people do better on tests of correctness. The underlying and unstated assumption seems to be that if you can do better on these tests, you have learned better. This assumption flies in the face of several decades of educational research, including our own [for example, 1, 2, 3]. Correctness is not adequate evidence of understanding or real-world skill. If we want to know how well people understand new knowledge, we must observe how they apply this knowledge in real-world contexts. If we want to evaluate their level of skill, we must observe how well they apply the skill in real-world contexts. In other words, a course in learning how to learn—especially a course in learning how to learn—should be building useable skills that have value beyond the act of passing a test of correctness.
Fifth, the research behind this course can help us understand how learning works. At Lectica, we’ve used the very same information as part of the basis for our learning model, VCoL+7. But instead of using this knowledge to support the status quo—an educational system that privileges correctness over understanding and skill—we’re using it to build learning tools designed to ensure that learning in school goes beyond correctness to build deep understanding and robust skill.
For the vast majority of people, schooling is not an end in itself. It is preparation for life—preparation with tomorrow’s skills. It’s time we held our educational institutions accountable for ensuring that students know how to learn more than correct answers. Wherever their lives take them, they will do better if equipped with understanding and skill. Correctness is not enough.
 FairTest; Mulholland, Quinn (2015). The case against standardized testing. Harvard Political Review, May 14.
 Schwartz, M. S., Sadler, P. M., Sonnert, G. & Tai, R. H. (2009). Depth versus breadth: How content coverage in high school science courses relates to later success in college science coursework. Science Education, 93, 5, 798-826.
During the 70s and 80s I practiced midwifery. It was a great honor to be present at the births of over 500 babies, and in many cases, follow them into childhood. Every single one of those babies was a joyful, driven, and effective "every moment" learner. Regardless of difficulty and pain they all learned to walk, talk, interact with others, and manipulate many aspects of their environment. They needed few external rewards to build these skills—the excitement and suspense of striving seemed to be reward enough. I felt like I was observing the "life force" in action.
Unfortunately as many of these children approached the third grade (age 8), I noticed something else—something deeply troubling. Many of the same children seemed to have lost much of this intrinsic drive to learn. For them, learning had become a chore motivated primarily by extrinsic rewards and punishments. Because this was happening primarily to children attending conventional schools (Children receiving alternative instruction seemed to be exempt.) it appeared that something about schooling was depriving many children of the fundamental human drive required to support a lifetime of learning and development—a drive that looked to me like a key source of happiness and fulfillment.
Understanding the problem
Following upon my midwifery career, I flirted briefly with a career in advertising, but by the early 90's I was back in school—in a Ph.D. program in U. C. Berkeley's Graduate School of Education—where I found myself observing the same pattern I'd observed as a midwife. Both the research and my own lab experience exposed the early loss of students' natural love of learning. My concern was only increased by the newly emerging trend toward high stakes multiple choice testing, which my colleagues and I saw as a further threat to children's natural drive to learn.
Most of the people I've spoken to about this problem have agreed that it's a shame, but few have seen it as a problem that can be solved, and many have seen it as an inevitable consequence of either mass schooling or simple maturation. But I knew it was not inevitable. Children and those educated in a range of alternative environments did not appear to lose their drive to learn. Additionally, above average students in conventional schools appeared to be more likely to retain their love of learning.
I set out to find out why—and ended up on a long journey toward a solution.
How learning works
First, I needed to understand how learning works. At Berkeley, I studied a wide variety of learning theories in several disciplines, including developmental theories, behavioral theories, and brain-based theories. I collected a large database of longitudinal interviews and submitted them to in-depth analysis, looked closely at the relation between testing and learning, and studied psychological measurement, all in the interest of finding a way to support childrens' growth while reinforcing their love of learning.
My dissertation—which won awards from both U.C. Berkeley and the American Psychological Association—focused on the development of people's conceptions of learning from age 5 through 85, and how this kind of knowledge could be used to measure and support learning. In 1998, I received $500,000 from the Spencer Foundation to further develop the methods designed for this research. Some of my areas of expertise are human learning and development, psychometrics, metacognition, moral education, and research methods.
In the simplest possible terms, what I learned in 5 years of graduate school is that the human brain is designed to drive learning, and that preserving that natural drive requires 5 ingredients:
a safe environment that is rich in learning opportunities and healthy human interaction,
a teacher who understands each child's interests and level of tolerance for failure,
a mechanism for determining "what comes next"—what is just challenging enough to allow for success most of the time (but not all of the time),
instant actionable feedback, and
the opportunity to integrate new knowledge or skills into each learner's existing knowledge network well enough to make it useable before pushing instruction to the next level. (We call this building a "robust knowledge network"—the essential foundation for future learning.)*
Identifying the solution
Once we understood what learning should look like, we needed to decide where to intervene. The answer, when it came, was a complete surprise. Understanding what comes next—something that can only be learned by measuring what a student understands now—was an integral part of the recipe for learning. This meant that testing—which we originally saw as an obstacle to robust learning—was actually the solution—but only if we could build tests that would free students to learn the way their brains are designed to learn. These tests would have to help teachers determine "what comes next" (ingredient 3) and provide instant actionable feedback (ingredient 4), while rewarding them for helping students build robust knowledge networks (ingredient 5).
Unfortunately, conventional standardized tests were focused on "correctness" rather than robust learning, and none of them were based on the study of how targeted concepts and skills develop over time. Moreover, they were designed not to support learning, but rather to make decisions about advancement or placement, based on how many correct answers students were able to provide relative to other students. Because this form of testing did not meet the requirements of our learning recipe, we'd have to start from scratch.
Developing the solution
We knew that our solution—reinventing educational testing to serve robust learning—would require many years of research. In fact, we would be committing to possible decades of effort without a guaranteed result. It was the vision of a future educational system in which all children retained their inborn drive for learning that ultimately compelled us to move forward.
To reinvent educational testing, we needed to:
make a deep study of precisely how children build particular knowledge and skills over time in a wide range of subject areas (so these tests could accurately identify "what comes next");
make tests that determine how deeply students understand what they have learned—how well they can use it to address real-world issues or problems (requires that students show how they are thinking, not just what they know—which means written responses with explanations); and
produce formative feedback and resources designed to foster "robust learning" (build robust knowledge networks).
Here's what we had to invent:
A learning ruler (building on Commons  and Fischer );
A method for studying how students learn tested concepts and skills (refining the methods developed for my dissertation);
A human scoring system for determining the level of understanding exhibited in students' written explanations (building upon Commons' and Fischer's methods, refining them until measurements were precise enough for use in educational contexts); and
An electronic scoring system, so feedback and resources could be delivered in real time.
It took over 20 years (1996–2016), but we did it! And while we were doing it, we conducted research. In fact, our assessments have been used in dozens of research projects, including a 25 million dollar study of literacy conducted at Harvard, and numerous Ph.D. dissertations—with more on the way.
What we've learned
We've learned many things from this research. Here are some that took us by surprise:
Students in schools that focus on building deep understanding graduate seniors that are up to 5 years ahead (on our learning ruler) of students in schools that focus on correctness (2.5 to 3 years after taking socioeconomic status into account).
Students in schools that foster robust learning develop faster and continue to develop longer (into adulthood) than students in schools that focus on correctness.
On average, students in schools that foster robust learning produce more coherent and persuasive arguments than students in schools that focus on correctness.
On average, students in our inner-city schools, which are the schools most focused on correctness, stop developing (on our learning ruler) in grade 10.
The average student who graduates from a school that strongly focuses on correctness is likely, in adulthood, to (1) be unable to grasp the complexity and ambiguity of many common situations and problems, (2) lack the mental agility to adapt to changes in society and the workplace, and (3) dislike learning.
From our perspective, these results point to an educational crisis that can best be addressed by allowing students to learn as their brains were designed to learn. Practically speaking, this means providing learners, parents, teachers, and schools with metrics that reward and support teaching that fosters robust learning.
Where we are today
Lectica has created the only metrics that meet all of these requirements. Our mission is to foster greater individual happiness and fulfillment while preparing students to meet 21st century challenges. We do this by creating and delivering learning tools that encourage students to learn the way their brains were designed to learn. And we ensure that students who need our learning tools the most get them first by providing free subscriptions to individual teachers everywhere.
To realize our mission, we organized as a nonprofit. We knew this choice would slow our progress (relative to organizing as a for-profit and welcoming investors), but it was the only way to guarantee that our true mission would not be derailed by other interests.
Thus far, we've funded ourselves with work in the for-profit sector and income from grants. Our background research is rich, our methods are well-established, and our technology works even better than we thought it would. Last fall, we completed a demonstration of our electronic scoring system, CLAS, a novel technology that learns from every single assessment taken in our system.
The groundwork has been laid, and we're ready to scale. All we need is the platform that will deliver the assessments (called DiscoTests), several of which are already in production.
After 20 years of high stakes testing, students and teachers need our solution more than ever. We feel compelled to scale a quickly as possible, so we can begin the process of reinvigorating today's students' natural love of learning, and ensure that the next generation of students never loses theirs. Lectica's story isn't finished. Instead, we find ourselves on the cusp of a new beginning!
A final note: There are many benefits associated with our approach to assessment that were not mentioned here. For example, because the assessment scores are all calibrated to the same learning ruler, students, teachers, and parents can easily track student growth. Even better, our assessments are designed to be taken frequently and to be embedded in low-stakes contexts. For grading purposes, teachers are encouraged to focus on growth over time rather than specific test scores. This way of using assessments pretty much eliminates concerns about cheating. And finally, the electronic scoring system we developed is backed by the world's first "taxonomy of learning," which also serves many other educational and research functions. It's already spawned a developmentally sensitive spell-checker! One day, this taxonomy of learning will be robust enough to empower teachers to create their own formative assessments on the fly.
We've been hearing quite a bit about the "proficiency vs. growth" debate since Betsy DeVos (Trump's candidate for Education Secretary) was asked to weigh in last week. This debate involves a disagreement about how high stakes tests should be used to evaluate educational programs. Advocates for proficiency want to reward schools when their students score higher on state tests. Advocates for growth want to reward schools when their students grow more on state tests. Readers who know about Lectica's work can guess where we'd land in this debate—we're outspokenly growth-minded.
For us, however, the proficiency vs. growth debate is only a tiny piece of a broader issue about what counts as learning. Here's a sketch of the situation as we see it:
Getting a higher score on a state test means that you can get more correct answers on increasingly difficult questions, or that you can more accurately apply writing conventions or decode texts. But these aren't the things we really want to measure. They're "proxies"—approximations of our real learning objectives. Test developers measure proxies because they don't know how to measure what we really want to know.
What we really want to know is how well we're preparing students with the skills and knowledge they'll need to successfully navigate life and work.
Scores on conventional tests predict how well students are likely to perform, in the future, on conventional tests. But scores on these tests have not been shown to be good predictors of success in life.*
In light of this glaring problem with conventional tests, the debate between proficiency and growth is a bit of a red herring. What we really need to be asking ourselves is a far more fundamental question:
What knowledge and skills will our children need to navigate the world of tomorrow, and how can we best nurture their development?
That's the question that frames our work here at Lectica.
*For information about the many problems with conventional tests, see FairTest.
Ten years ago, Kirschner, Sweller, & Clark published an article entitled, Why minimal guidance during instruction does not work: An analysis of the failure of constructivist, discovery, problem-based, experiential, and inquiry-based teaching.
In this article, Kirschner and his colleagues contrast outcomes for what they call "guidance instruction" (lecture and demonstration) with those from constructivism-based instruction. They conclude that constructivist approaches produce inferior outcomes.
The article suffers from at least three serious flaws.
First, the authors, in making their distinction between guided instruction and constructivist approaches, have created a caricature of constructivist approaches. Very few experienced practitioners of constructivist, discovery, problem-based, experiential, or inquiry-based teaching would characterize their approach as minimally guided. "Differently guided" would be a more appropriate term. Moreover, most educators who use constructivist approaches include lecture and demonstration where these are appropriate.
Second, the research reviewed by the authors was fundamentally flawed. For the most part, the metrics employed to evaluate different styles of instruction were not reasonable measures of the kind of learning constructivist instruction aims to support—deep understanding (the ability to apply knowledge effectively in real-world contexts). They were measures of memory or attitude. Back in 2010, Stein, Fisher, and I argued that metrics can't produce valid results if they don't actually measure what we care about (Redesigning testing: Operationalizing the new science of learning.) Why isn't this a no-brainer?
And finally, the longitudinal studies Kirschner and his colleagues reviewed had short time-spans. None of them examined the long-term impacts of different forms of instruction on deep understanding or long-term development. This is a big problem for learning research—one that is often acknowledged, but rarely addressed.
Since Kirschner's article was published in 2006, we've had an opportunity to examine the difference between schools that provide different kids of instruction, using assessments that measure the depth and coherence of students' understanding. We've documented a 3 to 5 year advantage, by grade 12, for students who attend schools that emphasize constructivist methods vs. those that use more "guidance instruction".
In this post, I'll be describing and comparing three basic forms of assessment—surveys, tests of factual and procedural knowledge, and performative tests.
Surveys—measures of perception, preference, or opinion
What is a survey? A survey (a.k.a. inventory) is any assessment that asks the test-taker to choose from a set of options, such as "strongly agree" or "strongly disagree", based on opinion, preference, or perception. Surveys can be used by organizations in several ways. For example, opinion surveys can help maintain employee satisfaction by providing a "safe" way to express dissatisfaction before workplace problems have a chance to escalate.
Surveys have been used by organizations in a variety of ways. Just about everyone who's worked for a large organization has completed a personality inventory as part of a team-building exercise. The results stimulate lots of water cooler discussions about which "type" or "color" employees are, but their impact on employee performance is unclear. (Fair warning: I'm notorious for my discomfort with typologies!) Some personality inventories are even used in high stakes hiring and promotion decisions, a practice that continues despite evidence that they are very poor predictors of employee success .
Although most survey developers don't pretend their assessments measure competence, many do. The item on the left was used in a survey with the words "management skills" in it's title.
Claims that surveys measure competence are most common when "malleable traits"—traits that are subject to change, learning or growth—are targeted. One example of a malleable trait is "EQ" or "emotional intelligence". EQ is viewed as a skill that can be developed, and there are several surveys that purport to measure its development. What they actually measure is attitude.
Another example of surveys masquerading as assessments of skill is in the measurement of "transformational learning". Transformational learning is defined as a learning experience that fundamentally changes the way a person understands something, yet the only way it appears to be measured is with surveys. Transformational learning surveys measure people's perceptions of their learning experience, not how much they are actually changed by it.
The only survey-type assessments that can be said to measure something like skill are assessments—such as 360s—that ask people about their perceptions. Although 360s inadvertently measure other things, like how much a person is liked or whether or not a respondent agrees with that person, they may also document evidence of behavior change. If what you are interested in is behavior change, a 360 may be appropriate in some cases, but it's important to keep in mind that while a 360 may measure change in a target's behavior, it's also likely to measure change in a respondent's attitude that's unrelated to the target's behavior.
360-type assessments may, to some extent, serve as tests of competence, because behavior change may be an indication that someone has learned new skills. When an assessment measures something that might be an indicator of something else, it is said to measure a proxy. A good 360 may measure a proxy (perceptions of behavior) for a skill (competence).
There are literally hundreds of research articles that document the limitations of surveys, but I'll mention only one more of them here: All of the survey types I've discussed are vulnerable to "gaming"—smart people can easily figure out what the most desirable answers are.
Surveys are extremely popular today because, relative to assessments of skill, they are inexpensive to develop and cost almost nothing to administer. Lectica gives away several high quality surveys for free because they are so inexpensive, yet organizations spend millions of dollars every year on surveys, many of which are falsely marketed as assessments of skill or competence.
Tests of factual and procedural knowledge
A test of competence is any test that asks the test taker to demonstrate a skill. Tests of factual and procedural knowledge can legitimately be thought of as tests of competence.
The classic multiple choice test examines factual knowledge, procedural knowledge, and basic comprehension. If you want to know if someone knows the rules, which formulas to apply, the steps in a process, or the vocabulary of a field, a multiple choice test may meet your needs. Often, the developers of multiple choice tests claim that their assessments measure understanding, reasoning, or critical thinking. This is because some multiple choice tests measure skills that are assumed to be proxies for skills like understanding, reasoning, and critical thinking. They are not direct tests of these skills.
Multiple choice tests are widely used, because there is a large industry devoted to making them, but they are increasingly unpopular because of their (mis)use as high stakes assessments. They are often perceived as threatening and unfair because they are often used to rank or select people, and are not helpful to the individual learner. Moreover, their relevance is often brought into question because they don't directly measure what we really care about—the ability to apply knowledge and skills in real-life contexts.
Tests that ask people to directly demonstrate their skills in (1) the real world, (2) real-world simulations, or (3) as they are applied to real-world scenarios are called performative tests. These tests usually do not have "right" answers. Instead, they employ objective criteria to evaluate performances for the level of skill demonstrated, and often play a formative role by providing feedback designed to improve performance or understanding. This is the kind of assessment you want if what you care about is deep understanding, reasoning skills, or performance in real-world contexts.
Performative tests are the most difficult tests to make, but they are the gold standard if what you want to know is the level of competence a person is likely to demonstrate in real-world conditions—and if you're interested in supporting development. Standardized performative tests are not yet widely used, because the methods and technology required to develop them are relatively new, and there is not yet a large industry devoted to making them. But they are increasingly popular because they support learning.
Unfortunately, performative tests may initially be perceived as threatening because people's attitudes toward tests of knowledge and skill have been shaped by their exposure to high stakes multiple choice tests. The idea of testing for learning is taking hold, but changing the way people think about something as ubiquitous as testing is an ongoing challenge.
Lectical Assessments are performative tests—tests for learning. They are designed to support robust learning—the kind of learning that optimizes the growth of essential real-world skills. We're the leader of the pack when it comes to the sophistication of our methods and technology, our evidence base, and the sheer number of assessments we've developed.
 Frederick P. Morgeson, et al. (2007) Are we getting fooled again? Coming to terms with limitations in the use of personality tests for personnel selection, Personnel Psychology, 60, 1029-1033.
We incorporate feedback loops called virtuous cycles in everything we do. And I mean everything. Our governance structure is fundamentally iterative. (We're a Sociocracy.) Our project management approach is iterative. (We use Scrum.) We develop ideas iteratively. (We use Design Thinking.) We build our learning tools iteratively. (We use developmental maieutics.) And our learning model is iterative. (We use the virtuous cycle of learning.) One important reason for using all of these iterative processes is that we want every activity in our organization to reward learning. Conveniently, all of the virtuous cycles we iterate through do double duty as virtuous cycles of learning.
All of this virtuous cycling has an interesting (and unprecedented) side effect. The score you receive on one of our assessments is subject to change. Yes, because we learn from every single assessment taken in our system, what we learn could cause your score on any assessment you take here to change. Now, it's unlikley to change very much, probably not enough to affect the feedback you receive, but the fact that scores change from time to time can really shake people up. Some people might even think we've lost the plot!
But there is method in our madness. Allowing your score to fluctuate a bit as our knowledge base grows is our way of reminding everyone that there's uncertainty in any test score, and ourselves that there's always more to learn about how learning works.