For some time now, people have been asking us how they can learn at least some of what we teach in our certification courses—but without the homework! Well, we’ve taken the plunge, with two new self-guided courses.
All profits from sales support Lectica’s mission to deliver the world’s best assessments free of charge to K-12 teachers everywhere!
In LAP-1 Light, we’ve brought together the lectures and much of the course material offered in the certification version of the course—Lectical Assessments in Practice for Coaches. You’ll take a deep dive into our learning model and learn how two of our most popular adult assessments—the LDMA (focused on leadership decision making) and the LSUA (focused on leaders’ understanding of themselves in workplace relationships)—are used to support leader development.
This course is perfect for coaches or consultants who are thinking about certifying down the road.
In LAP-2 Light, we’re offering all of the lectures and much of the course material from LAP-2—Lectical Assessments in Practice for Recruitment Professionals. You’ll learn about Lectica’s Human Capital Value Chain, conventional recruitment practices, how to evaluate recruitment assessments, and all about Lectica’s recruitment products—including Lectica First (for front-line to mid-level recruitment) and Lectica Suite (for senior recruitment).
This course is perfect for recruitment professionals of all kinds, or for anyone who is toying with the idea of becoming accredited in the use of our recruitment tools.
How to recruit the brain’s natural motivational cycle—the power of fit-to-role.
People learn and work better when the challenges they face in their roles are just right—when there is good fit-to-role. Improving fit-to-role requires achieving an optimal balance between an individual’s level of skill and role requirements. When employers get this balance right, they increase engagement, happiness (satisfaction), quality of communication, productivity, and even cultural health.
In the workplace, the challenges we’re expected to face should be just big enough to allow for success most of the time, but not so big that frequent failure is inevitable. My colleagues and I call this balance-point the Goldilocks zone, because it’s where the level of challenge is just right. Identifying the Goldilocks zone is important for three reasons:
First, and most obviously, it’s not good for business if people make too many mistakes.
Second, if the distance between employees’ levels of understanding and the difficulty of the challenges they face is too great, employees are less likely to understand and learn from their mistakes. This kind of gap can lead to a vicious cycle, in which, instead of improving or staying the same, performance gradually deteriorates.
Third, when a work challenge is just right we’re more likely to enjoy ourselves—and feel motivated to work even harder. This is because challenges in the Goldilocks zone, allow us to succeed just often enough to stimulate our brains to release pleasure hormones called opioids. Opioids give us a sense of satisfaction and pleasure. And they have a second effect. They also trigger the release of dopamine—the striving hormone—which motivates us to reach for the next challenge (so we can experience the satisfaction of success once again).
The dopamine-opioid cycle will repeat indefinitely in a virtuous cycle, but only when enough of our learning challenges are in the zone—not too easy and not too hard. As long as the dopamine-opioid cycle keeps cycling, we feel engaged. Engaged people are happy people—they tend to feel satisfied, competent, and motivated. 
People are also happier when they feel they can communicate effectively and build understanding with those around them. When organizations get fit-to-role right for every member of a team, they’re also building a team with members who are more likely to understand one another. This is because the complexity level of role requirements for different team members are likely to be very similar. So, getting fit to role right for one team member means building a team in which members are performing within a complexity range that makes it relatively—but not too—easy for members to understand one another. Team members are happiest when they can be confident that—most of the time and with reasonable effort—they will be able to achieve a shared understanding with other members.
A team representing a diversity of perspectives and skills, composed of individuals performing within a complexity range of 10–20 points on the Lectical Scale is likely to function optimally.
Getting fit-to-role right, also ensures that line managers are slightly more complex thinkers than their direct reports. People tend to prefer leaders they can look up to, and most of us intuitively look up to people who think a little more complexly than we do.  When it comes to line managers, If we’re as skilled as they are, we tend to wonder why they’re leading us. If we’re more skilled than they are, we are likely to feel frustrated. And if they’re way more skilled than we are, we may not understand them fully. In other words, we’re happiest when our line managers challenge us—but not too much. (Sound familiar?)
Most people work better with line managers who perform 15–25 points higher on the Lectical Scale than they do.
Unsurprisingly, all this engagement and happiness has an impact on productivity. Individuals work more productively when they’re happily engaged. And teams work more productively when their members communicate well with one another.
The moral of the story
The moral of this story is that employee happiness and organizational effectiveness are driven by the same thing—fit-to-role. We don’t have to compromise one to achieve the other. Quite the contrary. We can’t achieve either without achieving fit-to-role.
To sum up, when we get fit to role right—in other words, ensure that every employee is in the zone—we support individual engagement & happiness, quality communication in teams, and leadership effectiveness. Together, these outcomes contribute to productivity and cultural health.
Getting fit-to-role right requires top-notch recruitment and people development practices, starting with the ability to measure the complexity of (1) role requirements and (2) people skills.
When my colleagues and I think about the future of recruitment and people development, we envision healthy, effective organizations characterized by engaged, happy, productive, and constantly developing employees & teams. We help organizations achieve this vision by…
reducing the cost of recruitment so that best practices can be employed at every level in an organization;
improving predictions of fit-to- role;
broadening the definition of fit-to-role to encompasses the role, the team, and the position of a role in the organizational hierarchy; and
promoting the seamless integration of recruitment with employee development strategy and practice.
 Csikszentmihalyi, M., Flow, the psychology of happiness. (2008) Harper-Collins.
 Oishi, S., Koo, M., & Akimoto, S. (2015) Culture, interpersonal perceptions, and happiness in social interactions, Pers Soc Psychol Bull, 34, 307–320.
 Oswald, A. J., Proto, E., & Sgroi, D. (2015). Happiness and productivity. Journal of labor economics, 33, 789-822.
Fleisch Kincaid and other reading level metrics are sometimes employed to compare the arguments made by politicians in their speeches, interviews, and writings. What are these metrics and what do they actually tell us about these verbal performances?
Fleisch Kincaid examines sentence, word length, and syllable number. Texts are considered “harder” when they have longer sentences and use words with more letters, and “easier” when they have shorter sentences and use words with fewer letters. For decades, Fleisch Kincaid and other reading level metrics have been used in word processors. When you are advised by a grammar checker that the reading level of your article is too high, it’s likely that this warning is based on word and sentence length.
Other reading level indicators, like Lexiles, use the commonness of words as an indicator. Texts are considered to be easier when the words they contain are more common, and more difficult when the words they contain are less common.
Because reading-level metrics are embedded in most grammar checkers, writers are continuously being encouraged to write shorter sentences with fewer, more common words. Writers for news media, advertisers, and politicians, all of whom care deeply about market share, work hard to create texts that meet specific “grade level” requirements. And if we are to judge by analyses of recent political speeches, this has considerably “dumbed down” political messages.
Weaknesses of reading level indicators
Reading level indicators look only at easy-to-measure things like length and frequency. But length and frequency are proxies for what they purport to measure—how easy it is to understand the meaning intended by the author.
Let’s start with word length. Words of the same length or number of syllables can have meanings that are more or less difficult to understand. The word, information has 4 syllables and 12 letters. The word, validity has 4 syllables and 8 letters. Which concept, information or validity, do you think is easier to understand? (Hint, one concept can’t be understood without a pretty rich understanding of the other.)
How about sentence length? These two sentences express the same meaning. “He was on fire.” “He was so angry that he felt as hot as a fire inside.” In this case, the short sentence is more difficult because it requires the reader to understand that it should be read within a context presented in an earlier sentence—”She really knew how to push his buttons.”
Finally, what about commonness? Well, there are many words that are less common but no more difficult to understand than other words. Take “giant” and “enormous.” The word, enormous doesn’t necessarily add meaning, it’s just used less often. It’s not harder, just less popular. And some relatively common words are more difficult to understand than less common words. For example, evolution is a common word with a complex meaning that’s quite difficult to understand, and onerous is an uncommon word that’s relatively easy to understand.
I’m not arguing that reducing sentence and word length and using more common words don’t make prose easier to understand, but metrics that use these proxies don’t actually measure understandability—or at least they don’t do it very well.
How reading level indicators relate to complexity level
When my colleagues and I analyze the complexity level of a text, we’re asking ourselves, “At what level does this person understand these concepts?” We’re looking for meaning, not word length or popularity. Level of complexity directly represents level of understanding.
Reading level indicators do correlate with complexity level. Correlations are generally within the range of .40 to .60, depending on the sample and reading level indicator. These are strong enough correlations to suggest that 16% to 36% of what reading-level indicators measure is the same thing we measure. In other words, they are weak measures of meaning. They are stronger measures of factors that impact readability, but are not related directly to meaning—sentence and word length and/or commonness.
Here’s an example of how all of this plays out in the real world: The New York Times is said to have a grade 7 Fleisch Kincaid reading level, on average. But complexity analyses of their articles yield scores of 1100-1145. In other words, these articles express meanings that we don’t see in assessment responses until college and beyond. This would explain why the New York Times audience tends to be college educated.
We would say that by reducing sentence and word length, New York Times writers avoid making complex ideas harder to understand.
Reading level indicators are flawed measures of understanding. They are also dinosaurs. When these tools were developed, we couldn’t do any better. But advances in technology, research methods, and the science of learning have taken us beyond proxies for understanding to direct measures of understanding. The next challenge is figuring out how to ensure that these new tools are used responsibly—for the good of all.
Shortly after the President passed the Montreal Cognitive Assessment, a reader emailed with two questions:
Does this mean that the President has the cognitive capacity required of a national leader?
How does a score on this test relate to the complexity level scores you have been describing in recent posts?
A high score on the Montreal Cognitive Assessment dos not mean that the President has the cognitive capacity required of a national leader. This test result simply means there is a high probability that the President is not suffering from mild cognitive impairment. (The test has been shown to detect existing cognitive impairment 88% of the time .) In order to determine if the President has the mental capacity to understand the complex issues he faces as a National Leader, we need to know how complexly he thinks about those issues.
The answer to the second question is that there is little relation between scores on the Montreal Cognitive Assessment and the complexity level of a person’s thinking. A test like the Montreal Cognitive Assessment does not require the kind of thinking a President needs to understand highly complex issues like climate change or the economy. Teenagers can easily pass this test.
The difference between 1053 and 1137 generally represents a decade or more of sustained learning. (If you’re a new reader and don’t yet know what a complexity level is, check out the National Leaders Series introductory article.)
 JAMA Intern Med. 2015 Sep;175(9):1450-8. doi: 10.1001/jamainternmed.2015.2152. Cognitive Tests to Detect Dementia: A Systematic Review and Meta-analysis. Tsoi KK, Chan JY, Hirai HW, Wong SY, Kwok TC.
There’s a battle out there no one’s tweeting about. It involves a tension between statistical significance and practical significance. If you make decisions that involve evaluating evidence—in other words, if you are human—understanding the distinction between these two types of significance will significantly improve your decisions (both practically and statistically).
Statistical significance (a.k.a. “p”) is a calculation made to determine how confident we can be that a relationship between two factors (variables) is real. The lower a p value, the more confident we can be. Most of the time, we want p to be less than .05.
Don’t be misled! A low p value tells us nothing about the size of a relationship between two variables. When someone says that statistical significance is high, all this means is that we can be more confident that the relationship is real.
Once we know we can be confident that a relationship between two variables is real, we should check to see if the research has been replicated. That’s because we can’t be sure a statistically significant relationship found in a single study is really real. After we’ve determined that a relationship is statistically significant and replicable, it’s time to consider practical significance. Practical significance has to do with the size of the relationship.
To figure out how practically significant a relationship is, we need to know how big it is. The size of a relationship, or effect size, is evaluated independently of p. For a plain English discussion of effect size, check out this article, Statistics for all: prediction.
The greater the size of a relationship between two variables, the more likely the relationship is to be important — but that’s not enough. To have real importance, a relationship must also matter. And it is the decision-maker who decides what matters.
Let’s look at one of my favorite examples. The results of high stakes tests like the SAT and GRE — college entrance exams made by ETS — have been shown to predict college success. Effect sizes tend to be small, but the effects are statistically significant — we can have confidence that they are real. And evidence for these effects have come from numerous studies, so we know they are really real.
If you’re the president of a college, there is little doubt that these test scores have practical significance. Improving prediction of student success, even a little, can have a big impact on the bottom line.
If you’re an employer, you’re more likely to care about how well a student did in college than how they did prior to college, so SAT and GRE scores are likely to be less important to you than college success.
If you’re a student, the size of the effect isn’t important at all. You don’t make the decision about whether or not the school is going to use the SAT or GRE to filter students. Whether or not these assessments are used is out of your control. What’s important to you is how a given college is likely to benefit you.
If you’re me, the size of the effect isn’t very important either. My perspective is that of someone who wants to see major changes in the educational system. I don’t think we’re doing our students any favors by focusing on the kind of learning that can be measured by tests like the GRE and SAT. I think our entire educational system leans toward the wrong goal—transmitting more and more “correct” information. I think we need to ask if what students are learning in school is preparing them for life.
Another thing to consider when evaluating practical significance is whether or not a relationship between two variables tells us only part of a more complex story. For example, the relationship between ethnicity and the rate of developmental growth (what my colleagues and I specialize in measuring) is highly statistically significant (real) and fairly strong (moderate effect size). But, this relationship completely disappears once socioeconomic status (wealth) is taken into account. The first relationship is misleading (spurious). The real culprit is poverty. It’s a social problem, not an ethnic problem.
Most discussions of practical significance stop with effect size. From a statistical perspective, this makes sense. Statistics can’t be used to determine which outcomes matter. People have to do that part, but statistics, when good ones are available, should come first. Here’s my recipe:
My organization, Lectica, Inc., is a 501(c)3 nonprofit corporation. Part of our mission is to share what we learn with the world. One of the things we’ve learned is that many assessment buyers don’t seem to know enough about statistics to make the best choices. The Statistics for all series is designed to provide assessment buyers with the knowledge they need most to become better assessment shoppers.
Why you might want to reconsider using 360s and EQ assessments to predict recruitment success
Measurements are often used to make predictions. For example, they can help predict how tall a 4-year-old is likely to be in adulthood, which students are likely to do better in an academic program, or which candidates are most likely to succeed in a particular job.
Some of the attributes we measure are strong predictors, others are weaker. For example, a child’s height at age 4 is a pretty strong predictor of adult height. Parental height is a weaker predictor. The complexity of a person’s workplace decision making, on its own, is a moderate predictor of success in the workplace. But the relation between the complexly of their workplace decision making and the complexity of their role is a strong predictor.
How do we determine the strength or a predictor? In statistics, the strength of predictions is represented by an effect size. Most effect size indicators are expressed as decimals and range from .00 –1.00, with 1.00 representing 100% accuracy of prediction. The effect size indicator you’ll see most often is r-square. If you’ve ever been forced to take a statistics course—;)—you may remember that r represents the strength of a correlation. Before I explain r-square, let’s look at some correlation data.
The four figures below represent 4 different correlations, from weakest (.30) to strongest (.90). Let’s say the vertical axis (40 –140) represents the level of success in college, and the horizontal axis (50 –150) represents scores on one of 4 college entrance exams. The dots represent students. If you were trying to predict success in college, you would be wise to choose the college entrance exam that delivered an r of .90.
Why is an r of .90 preferable? Well, take a look at the next set of figures. I’ve drawn lines through the clouds of dots (students) to show regression lines. These lines represent the prediction we would make about how successful a student will be, given a particular score. It’s clear that in the case of the first figure (r =.30), this prediction is likely to be pretty inaccurate. Many students perform better or worse than predicted by the regression line. But as the correlations increase in size, prediction improves. In the case of the fourth figure (r =.90), the prediction is most accurate.
What does a .90 correlation mean in practical terms? That’s where r-square comes in. If we multiply .90 by .90 (calculate the square), we get an r-square of .81. Statisticians would say that the predictor (test score), explains 81% of the variance in college success. The 19% of the variance that’s not explained (1.00 -.81 =.19) represents the percent of the variance that is due to error (unexplained variance). The square root of 19% is the amount of error (.44).
Correlations of .90 are very rare in the social sciences—but even correlations this strong are associated with a significant amount of error. It’s important to keep error in mind when we use tests to make big decisions—like who gets hired or who gets to go to college. When we use tests to make decisions like these, the business or school is likely to benefit—slightly better prediction can result in much better returns. But there are always rejected individuals who would have performed well, and there are always accepted individuals who will perform badly.
Let’s get realistic. As I mentioned earlier, correlations of .90 are very rare. In recruitment contexts, the most predictive assessments (shown above) correlate with hire success in the range of .50 –.54, predicting from 25% – 29% of the variance in hire success. That leaves a whopping 71% – 75% of the variance unexplained, which is why the best hiring processes not only use the most predictive assessments, but also consider multiple predictive criteria.
On the other end of the spectrum, there are several common forms of assessment that explain less than 9% of the variance in recruitment success. Their correlations with recruitment success are lower than .30. Yet some of these, like 360s, reference checks, and EQ, are wildly popular. In the context of hiring, the size of the variance explained by error in these cases (more than 91%) means there is a very big risk of being unfair to a large percentage of candidates. (I’m pretty certain assessment buyers aren’t intentionally being unfair. They probably just don’t know about effect size.)
If you’ve read my earlier article about replication, you know that the power-posing research could not be replicated. You also might be interested to learn that the correlations reported in the original research were also lower than .30. If power-posing had turned out to be a proven predictor of presentation quality, the question I’d be asking myself is, “How much effort am I willing to put into power-posing when the variance explained is lower than 9%?”
If we were talking about something other than power-posing, like reducing even a small risk that my child would die of a contagious disease, I probably wouldn’t hesitate to make a big effort. But I’m not so sure about power-posing before a presentation. Practicing my presentation or getting feedback might be a better use of my time.
Summing up (for now)
A basic understanding of prediction is worth cultivating. And it’s pretty simple. You don’t even have to do any fancy calculations. Most importantly, it can save you time and tons of wasted effort by giving you a quick way to estimate the likelihood that an activity is worth doing (or product is worth having). Heck, it can even increase fairness. What’s not to like?
My organization, Lectica, Inc., is a 501(c)3 nonprofit corporation. Part of our mission is to share what we learn with the world. One of the things we’ve learned is that many assessment buyers don’t seem to know enough about statistics to make the best choices. The Statistics for all series is designed to provide assessment buyers with the knowledge they need most to become better assessment shoppers.
(Why you should have been suspicious of power-posing from the start!)
I’ve got a free, low-tech life hack for you that will save significant time and money — and maybe even improve your health. All you need to do is one little thing. Before you let the latest research results change your behavior, check to see if the research has been replicated!
One of the hallmarks of modern science is the notion that one study of a new phenomenon—especially a single small study—proves nothing. Most of the time, the results of such studies can do little more than suggest possibilities. To arrive at proof, results have to be replicated—again and again, usually in a variety of contexts. This is important, especially in the social sciences, where phenomena are difficult to measure and the results of many new studies cannot be replicated.
Researchers used to be trained to avoid even implying that findings from a new study were proven facts. But when Amy Cuddy set out to share the results of her and her colleagues’ power-posing research, she didn’t simply imply that her results could be generalized. She unabashedly announced to an enthralled Ted Talk audience that she’d discovered a “Free, no-tech life hack…that could significantly change how your life unfolds.”
Thanks to this talk, many thousands—perhaps millions—of people-hours have been spent power-posing. But it’s not the power-posers whose lives have changed. Unfortunately, as it turns out, it’s Dr. Cuddy’s life that changed significantly—when other researchers were unable to replicate her results. In fact, because she had made such strong unwarranted claims, Dr. Cuddy became the focus of severe criticism.
Although she was singled out, Dr. Cuddy is far from alone. She’s got lots of company. Many fads have begun just like Power Posing did. Here’s how it goes: A single small study produces results that have “novelty appeal,” the Today Show picks up the story, and thousands jump on the bandwagon! Sometimes, as in the case of power-posing, the negative impact is no worse than a bit of wasted time. But in other cases, such as when our heath or pocketbooks are at stake, the impacts can be much greater.
“But it worked for me!” If you tried power-posing and believe it was responsible for your success in achieving an important goal, you may be right. The scientific method isn’t perfect — especially in the social sciences — and future studies with better designs may support your belief. However, I recommend caution in relying on personal experience. Humans have powerful built-in mental biases that lead us to conclude that positive outcomes are caused by something we did to induce them. This makes it very difficult for us to distinguish between coincidence and cause. And it’s one reason we need the scientific method, which is designed to help us reduce the impact of these biases.
Replication matters in assessment development, too
Over the last couple of decades, I’ve looked at the reliability & validity evidence for many assessments. The best assessment developers set a pretty high replication standard, conducting several validity & reliability studies for each assessment they offer. But many assessment providers—especially those serving businesses—are much more lax. In fact, many can point to only a single study of reliability and validity. To make matters worse, in some cases, that study has not been peer reviewed.
Be wary of assessments that aren’t backed by several studies of reliability and validity.
In the first post in this series, I promised to share a quick and dirty trick for determining how much confidence you can have in a test score. I will. But first, I want to show you a bit more about what estimating confidence means when it comes to educational and psychological tests.
Let’s start with a look at how test scores are usually reported. The figure below shows three scores, one at level 8, one at level 6, and one at level 4. Looking at this figure, most of us would be inclined to assume that these scores are what they seem to be—precise indicators of the level of a trait or skill.
But this is not the case. Test scores are fuzzy. They’re best understood as ranges ratherthan as points on a ruler. In other words, test scores are always surrounded by confidence intervals. A person’s true score is likely to fall somewhere in the range described by the confidence interval around a test score.
In order to figure out how fuzzy a test score actually is, you need one thing—an indicator of statistical reliability. Most of the time, this is something called Cronbach’s Alpha. All good test developers publish information about the statistical reliability of their measures, ideally in refereed academic journals with easy to find links on their web sites! If a test developer won’t provide you with information about Alpha (or its equivalent) for each score reported on a test, it’s best to move on.
The higher the reliability (usually Alpha) the smaller the confidence interval. And the smaller the confidence interval, the more confidence you can have in a test score.
The table below will help to clarify why it is important to know Alpha (or its equivalent). It shows the relationship between Alpha (which can range from 0 to 1.0) and the number of distinct levels (strata) a test can be said to have. For example, an assessment with a reliability of .80, has 3 strata, whereas an assessment with a reliability of .94 has 5.
Strata have direct implications for the confidence we can have in a person’s score on a given assessment, because they tell us about the range within which a person’s true score would fall—its confidence interval—given the score awarded.
Imagine that you have just taken a test of emotional intelligence with a score range of 1 to 10 and a reliability of .95. The number of strata into which an assessment with a reliability of .95 can be divided is about 6, which means that each strata equals about 1.75 points on the 10 point scale (10 divided by 6). If your score on this test was 8, your true score would likely be somewhere between 7.13 and 8.88—your score’s confidence interval.
The figure below shows the true score ranges for three test takers, CB, RM, and PR. The fact that these ranges don’t overlap gives us confidence that the emotional intelligence of these test-takers is actually different**.
If these scores were closer together, their confidence intervals would overlap. And if that was the case—for example if you were comparing two individuals with scores of 8 and 8.5—it would not be correct to say the scores were different form one another. In fact, it would be incorrect for a hiring manager to consider the difference between a score of 8 and a score of 8.5 in making a choice between two job candidates.
By the way, tests with Alphas in the range of .94 or higher are considered suitable for high-stakes use (assuming that they meet other essential validity requirements). What you see in the figure below is about as good as it gets in educational and psychological assessment.
Most assessments used in organizations do not have Alphas that are anywhere near .95. Some of the better assessments have Alphas as high as .85. Let’s take a look at what an Alpha at this level does to confidence intervals.
If the test you have taken has a score range of 1–10 and an Alpha (reliability) of .85, the number of strata into which this assessment can be divided is about 3.4, which means that each strata equals about 2.9 (10 divided by 3.4) points on the 10 point scale. In this case, if you receive a score of 8, your true score is likely to fall within the range of 6.6 to 9.5*.
In the figure below, note that CB’s true score range now overlaps RM’s true score range and RM’s true score range overlaps PR’s true score range. This means we cannot say—with confidence—that CB’s score is different from RM’s score, or that RM’s score is different from PR’s score.
Assessments with Alphas in the .85 range are suitable for classroom use or low-stakes contexts. Yet, every day, schools and businesses use tests with reliabilities in the .85 range to make high stakes decisions—such as who will be selected for advancement or promotion. And this is often done in a way that would exclude RM (yellow circle) even though his confidence interval overlaps CB’s (teal circle) confidence interval.
Many tests used in organizations have Alphas in the .75 range. If the test you have taken has a score range of 1–10 and an Alpha of .75, the number of strata into which this assessment can be divided is about 2.2, which means that each strata equals about 4.5 points on the 10 point scale. In this case, if you receive a score of 8, your true score is likely to fall within the range of 6–10*.
As shown in the figure below, scores would now have to differ by at least 4.5 points in order for us to distinguish between two people. CB’s and PR’s scores are different, but RM’s score is uninterpretable.
Tests or sub-scales with alphas in the .75 range are considered suitable for research purposes. Yet, sad to say, schools and businesses now use tests with scales or sub-scales that have Alphas in or below the .75 range, treating these scores as if they provide useful information, when in most cases the scores—like RM’s—are uninterpretable.
If your current test providers are not reporting true score ranges (confidence intervals), ask for them. If they only provide Alphas (reliability statistics) you can use the table and figures in this article to calculate true score ranges for yourself. If you don’t want to do the math, no problem. You can use the figures above to get a feel for how precise a score is.
Statistical reliability is only one of the ways in which assessments should be evaluated. Test developers should also ask how well an assessment measures what it is intended to measure. And those who use an assessment should ask whether or not what it measures is relevant or important. I’ll be sharing some tricks for looking at these forms of validity in future articles.
Special thanks to my Australian colleague, Aiden Thornton, for his editorial and research assistance.
This is the first in a series of articles on the complexity of national leaders’ thinking. These articles will report results from research conducted with CLAS, our newly validated electronic developmental scoring system. CLAS will be used to score these leaders’ responses to questions posed by prominent journalists.
In this first article, I’ll be providing some of the context for this project, including information about how my colleagues and I think about complexity and its role in leadership. I’ve embedded lots of links to additional material for readers who have questions about our 100+ year-old research tradition, Lectica’s (the nonprofit that owns me) assessments, and other research we’ve conducted with these assessments.
Context and research questions
Lectica creates diagnostic assessments for learningthat support the development of mental skills required for working with complexity. We make these learning tools for both adults and children. Our K-12 initiative—the DiscoTest Initiative—is dedicated to bringing these tools to individual K-12 teachers everywhere, free of charge. Its adult assessments are used by organizations in recruitment and training, and by colleges and universities in admissions and program evaluation.
All Lectical Assessments measure the complexity level (aka, level of vertical development) of people’s thinking in particular knowledge areas. A complexity level score on a Lectical Assessment tells us the highest level of complexity—in a problem, issue, or task—an individual is likely to be able to work with effectively.
On several occasions over the last 20 years, my colleagues and I have been asked to evaluate the complexity of national leaders’ reasoning skills. Our response has been, “We will, but only when we can score electronically—without the risk of human bias.” That time has come. Now that our electronic developmental scoring system, CLAS, has demonstrated a level of reliability and precision that is acceptable for this purpose, we’re ready to take a look.
Evaluating the complexity of national leaders’ thinking is a challenging task for several reasons. First, it’s virtually impossible to find examples of many of these leaders’ “best work.” Their speeches are generally written for them, and speech writers usually try to keep the complexity level of these speeches low, aiming for a reading level in the 7th to 9th grade range. (Reading level is not the same thing as complexity level, but like most tests of capability, it correlates moderately with complexity level.) Second, even when national leaders respond to unscripted questions from journalists, they work hard to use language that is accessible to a wide audience. And finally, it’s difficult to identify a level playing field—one in which all leaders have the same opportunity to demonstrate the complexity of their thinking.
Given these obstacles, there’s no point in attempting to evaluate the actual thinking capabilities of national leaders. In other words, we won’t be claiming that the scores awarded by CLAS represent the true complexity level of leaders’ thinking. Instead, we will address the following questions:
When asked by prominent journalists to explain their positions on complex issues, what is the average complexity level of national leaders’ responses?
How does the complexity level of national leaders’ responses relate to the complexity of the issues they discuss?
Thinking complexity and leader success
At this point, you may be wondering, “What is thinking complexity and why is it important?” A comprehensive response to this question isn’t possible in a short article like this one, but I can provide a basic description of complexity as we see it at Lectica, and provide some examples that highlight its importance.
All issues faced by leaders are associated with a certain amount of built-in complexity. For example:
The sheer number of factors/stakeholders that must be taken into account.
Short and long-term implications/repercussions. (Will a quick fix cause problems downstream, such as global unrest or catastrophic weather?)
The number and diversity of stakeholders/interest groups. (What is the best way to balance the needs of individuals, families, businesses, communities, states, nations, and the world?)
The length of time it will take to implement a decision. (Will it take months, years, decades? Longer projects are inherently more complex because of changes over time.)
Formal and informal rules/laws that place limits on the deliberative process. (For example, legislative and judicial processes are often designed to limit the decision making powers of presidents or prime ministers. This means that leaders must work across systems to develop decisions, which further increases the complexity of decision making.)
Over the course of childhood and adulthood, the complexity of our thinking develops through up to 13 skill levels (0–12). Each new level builds upon the previous level. The figure above shows four adult complexity “zones” — advanced linear thinking (second zone of level 10), early systems thinking (first zone of level 11), advanced systems thinking (second zone of level 11), early principles thinking (first zone of level 12). In advanced linear thinking, reasoning is often characterized as “black and white.” Individuals performing in this zone cope best with problems that have clear right or wrong answers. It is only once individuals enter early systems thinking, that we begin to work effectively with highly complex problems that do not have clear right or wrong answers.
Leadership at the national level requires exceptional skills for managing complexity, including the ability to deal with the most complex problems faced by humanity (Helbing, 2013). Needless to say, a national leader regularly faces issues at or above early principles thinking.
Complexity level and leadership—the evidence
In the workplace, the hiring managers who decide which individuals will be put in leadership roles are likely to choose leaders whose thinking complexity is a good match for their roles. Even if they have never heard the term complexity level, hiring managers generally understand, at least implicitly, that leaders who can work with the complexity inherent in the issues associated with their roles are likely to make better decisions than leaders whose thinking is less complex.
There is a strong relation between the complexity of leadership roles and the complexity level of leaders’ reasoning. In general, more complex thinkers fill more complex roles. The figure below shows how lower and senior level leaders’ complexity scores are distributed in Lectica’s database. Most senior leaders’ complexity scores are in or above advanced systems thinking, while those of lower level leaders are primarily in early systems thinking.
The strong relation between the complexity of leaders’ thinking and the complexity of their roles can also be seen in the recruitment literature. To be clear, complexity is not the only aspect of leadership decision making that affects leaders’ ability to deal effectively with complex issues. However, a large body of research, spanning over 50 years, suggests that the top predictors of workplace leader recruitment success are those that most strongly relate to thinking skills, including complexity level.
The figure below shows the predictive power of several forms of assessment employed in making hiring and promotion decisions. Assessments of mental ability have been shown to have the highest predictive power. In other words, assessments of thinking skills do a better job predicting which candidates will be successful in a given role than other forms of assessment.
The match between the complexity of national leaders’ thinking and the complexity level of the problems faced in their roles is important. While we will not be able to assess the actual complexity level of the thinking of national leaders, we will be able to examine the complexity of their responses to questions posed by prominent journalists. In upcoming articles, we’ll be sharing our findings and discussing their implications.
Arthur, W., Day, E. A., McNelly, T. A., & Edens, P. S. (2003). A meta‐analysis of the criterion‐related validity of assessment center dimensions. Personnel Psychology, 56(1), 125-153.
Becker, N., Höft, S., Holzenkamp, M., & Spinath, F. M. (2011). The Predictive Validity of Assessment Centers in German-Speaking Regions. Journal of Personnel Psychology, 10(2), 61-69.
Beehr, T. A., Ivanitskaya, L., Hansen, C. P., Erofeev, D., & Gudanowski, D. M. (2001). Evaluation of 360 degree feedback ratings: relationships with each other and with performance and selection predictors. Journal of Organizational Behavior, 22(7), 775-788.
Dawson, T. L., & Stein, Z. (2004). National Leadership Study results. Prepared for the U.S. Intelligence Community.
Dawson, T. L. (2017, October 20). Using technology to advance understanding: The calibration of CLAS, an electronic developmental scoring system. Proceedings from Annual Conference of the Northeastern Educational Research Association, Trumbull, CT.
Dawson, T. L., & Thornton, A. M. A. (2017, October 18). An examination of the relationship between argumentation quality and students’ growth trajectories. Proceedings from Annual Conference of the Northeastern Educational Research Association, Trumbull, CT.
Gaugler, B. B., Rosenthal, D. B., Thornton, G. C., & Bentson, C. (1987). Meta-analysis of assessment center validity. Journal of Applied Psychology, 72(3), 493-511.
Helbing, D. (2013). Globally networked risks and how to respond. Nature, 497, 51-59.
Hunter, J. E., & Hunter, R. F. (1984). The validity and utility of alternative predictors of job performance. Psychological Bulletin, 96, 72-98.
Hunter, J. E., Schmidt, F. L., & Judiesch, M. K. (1990). Individual differences in output variability as a function of job complexity. Journal of Applied Psychology, 75, 28-42.
Johnson, J. (2001). Toward a better understanding of the relationship between personality and individual job performance. In M. R. R. Barrick, Murray R. (Ed.), Personality and work: Reconsidering the role of personality in organizations (pp. 83-120).
McDaniel, M. A., Schmidt, F. L., & Hunter, J., E. (1988a). A Meta-analysis of the validity of training and experience ratings in personnel selection. Personnel Psychology, 41(2), 283-309.
McDaniel, M. A., Schmidt, F. L., & Hunter, J., E. (1988b). Job experience correlates of job performance. Journal of Applied Psychology, 73, 327-330.
McDaniel, M. A., Whetzel, D. L., Schmidt, F. L., & Maurer, S. D. (1994). Validity of employment interviews. Journal of Applied Psychology, 79, 599-616.
Rothstein, H. R., Schmidt, F. L., Erwin, F. W., Owens, W. A., & Sparks, C. P. (1990). Biographical data in employment selection: Can validities be made generalizable? Journal of Applied Psychology, 75, 175-184.
Stein, Z., Dawson, T., Van Rossum, Z., Hill, S., & Rothaizer, S. (2013, July). Virtuous cycles of learning: using formative, embedded, and diagnostic developmental assessments in a large-scale leadership program. Proceedings from ITC, Berkeley, CA.
Tett, R. P., Jackson, D. N., & Rothstein, M. (1991). Personality measures as predictors of job performance: A meta-analytic review. Personnel Psychology, 44, 703-742.
This morning, I received a newsletter from Sir Ken Robinson, a popular motivational speaker who focuses on education. There was a return email address, so I wrote to him. Here's what I wrote:
Dear Sir Ken,
"I love your message. I'm one of the worker bees who's trying to leverage the kind of changes you envision.
After 20+ years of hard work, my colleagues and I have reinvented educational assessment. No multiple choice. No high stakes. Our focus is on assessment for learning—supporting students in learning joyfully and deeply in a way that facilitates skills for learning, thinking, inquiring, relating and otherwise navigating a complex world. Our assessments are scalable and standardized, but they do not homogenize. They are grounded in a deep study of the many pathways through which students learn key skills and concepts. We're documenting, in exquisite (some would say insane) detail, how concepts and skills develop over time so we can gain insight into learners' knowledge networks. We don't ask about correctness. We ask about understanding and competence and how they develop over time. And we help teachers meet students "where they're at."
We've accumulated a strong base of evidence to support these claims. But now that we're ready to scale, we're running up against hostility toward all standardized assessment. It's difficult to get to the point where we can even have a conversation with our pedagogical allies. Ouch!
Lectica is organized as a nonprofit so we can guarantee that the underprivileged are served first. We plan to offer subscriptions to our assessments (learning tools) without charge to individual teachers everywhere.
We've kept our heads down as we've developed our methods and technology. Now we're scaling and want to be seen. We know we're part of the solution to today's educational crisis—perhaps a very big part of the solution. I'm hoping you'd like to learn more."
My email was returned with this message: "The email account that you tried to reach does not exist." How frustrating.
So, I thought I'd pen this post and ask my friends and colleagues to help me get access to Sir Ken's ear. If you know him, please forward this message. I'm certain he'll be interested in what we're doing for learning and development. Where are you Sir Ken Robinson? Can you hear me? Are you out there?