Statistics for all: significance vs. significance

There’s a battle out there no one’s tweeting about. It involves a tension between statistical significance and practical significance. If you make decisions that involve evaluating evidence—in other words, if you are human—understanding the distinction between these two types of significance will significantly improve your decisions (both practically and statistically).

Statistical significance

Statistical significance (a.k.a. “p”) is a calculation made to determine how confident we can be that a relationship between two factors (variables) is real. The lower a p value, the more confident we can be. Most of the time, we want p to be less than .05.

Don’t be misled! A low p value tells us nothing about the size of a relationship between two variables. When someone says that statistical significance is high, all this means is that we can be more confident that the relationship is real.

Replication

Once we know we can be confident that a relationship between two variables is real, we should check to see if the research has been replicated. That’s because we can’t be sure a statistically significant relationship found in a single study is really real. After we’ve determined that a relationship is statistically significant and replicable, it’s time to consider practical significance. Practical significance has to do with the size of the relationship.

Practical significance

To figure out how practically significant a relationship is, we need to know how big it is. The size of a relationship, or effect size, is evaluated independently of p. For a plain English discussion of effect size, check out this article, Statistics for all: prediction.

Importance

The greater the size of a relationship between two variables, the more likely the relationship is to be important — but that’s not enough. To have real importance, a relationship must also matter. And it is the decision-maker who decides what matters.

Examples

Let’s look at one of my favorite examples. The results of high stakes tests like the SAT and GRE — college entrance exams made by ETS — have been shown to predict college success. Effect sizes tend to be small, but the effects are statistically significant — we can have confidence that they are real. And evidence for these effects have come from numerous studies, so we know they are really real.

If you’re the president of a college, there is little doubt that these test scores have practical significance. Improving prediction of student success, even a little, can have a big impact on the bottom line.

If you’re an employer, you’re more likely to care about how well a student did in college than how they did prior to college, so SAT and GRE scores are likely to be less important to you than college success.

If you’re a student, the size of the effect isn’t important at all. You don’t make the decision about whether or not the school is going to use the SAT or GRE to filter students. Whether or not these assessments are used is out of your control. What’s important to you is how a given college is likely to benefit you.

If you’re me, the size of the effect isn’t very important either. My perspective is that of someone who wants to see major changes in the educational system. I don’t think we’re doing our students any favors by focusing on the kind of learning that can be measured by tests like the GRE and SAT. I think our entire educational system leans toward the wrong goal—transmitting more and more “correct” information. I think we need to ask if what students are learning in school is preparing them for life.

Another thing to consider when evaluating practical significance is whether or not a relationship between two variables tells us only part of a more complex story. For example, the relationship between ethnicity and the rate of developmental growth (what my colleagues and I specialize in measuring) is highly statistically significant (real) and fairly strong (moderate effect size). But, this relationship completely disappears once socioeconomic status (wealth) is taken into account. The first relationship is misleading (spurious). The real culprit is poverty. It’s a social problem, not an ethnic problem.

Summing up

Most discussions of practical significance stop with effect size. From a statistical perspective, this makes sense. Statistics can’t be used to determine which outcomes matter. People have to do that part, but statistics, when good ones are available, should come first. Here’s my recipe:

  1. Find out if the relationship is real (p < .05).
  2. Find out if it is really real (replication).
  3. Consider the effect size.
  4. Decide how much it matters.

My organization, Lectica, Inc., is a 501(c)3 nonprofit corporation. Part of our mission is to share what we learn with the world. One of the things we’ve learned is that many assessment buyers don’t seem to know enough about statistics to make the best choices. The Statistics for all series is designed to provide assessment buyers with the knowledge they need most to become better assessment shoppers.

 

Please follow and like us:

Statistics for all: Prediction

Why you might want to reconsider using 360s and EQ assessments to predict recruitment success


Measurements are often used to make predictions. For example, they can help predict how tall a 4-year-old is likely to be in adulthood, which students are likely to do better in an academic program, or which candidates are most likely to succeed in a particular job.

Some of the attributes we measure are strong predictors, others are weaker. For example, a child’s height at age 4 is a pretty strong predictor of adult height. Parental height is a weaker predictor. The complexity of a person’s workplace decision making, on its own, is a moderate predictor of success in the workplace. But the relation between the complexly of their workplace decision making and the complexity of their role is a strong predictor.

How do we determine the strength or a predictor? In statistics, the strength of predictions is represented by an effect size. Most effect size indicators are expressed as decimals and range from .00 –1.00, with 1.00 representing 100% accuracy of prediction. The effect size indicator you’ll see most often is r-square. If you’ve ever been forced to take a statistics course—;)—you may remember that r represents the strength of a correlation. Before I explain r-square, let’s look at some correlation data.

The four figures below represent 4 different correlations, from weakest (.30) to strongest (.90). Let’s say the vertical axis (40 –140) represents the level of success in college, and the horizontal axis (50 –150) represents scores on one of 4 college entrance exams. The dots represent students. If you were trying to predict success in college, you would be wise to choose the college entrance exam that delivered an r of .90.

Why is an r of .90 preferable? Well, take a look at the next set of figures. I’ve drawn lines through the clouds of dots (students) to show regression lines. These lines represent the prediction we would make about how successful a student will be, given a particular score. It’s clear that in the case of the first figure (r =.30), this prediction is likely to be pretty inaccurate. Many students perform better or worse than predicted by the regression line. But as the correlations increase in size, prediction improves. In the case of the fourth figure (r =.90), the prediction is most accurate.

What does a .90 correlation mean in practical terms? That’s where r-square comes in. If we multiply .90 by .90 (calculate the square), we get an r-square of .81. Statisticians would say that the predictor (test score), explains 81% of the variance in college success. The 19% of the variance that’s not explained (1.00 -.81 =.19) represents the percent of the variance that is due to error (unexplained variance). The square root of 19% is the amount of error (.44).

Even when r = .90, error accounts for 19% of the variance.

Correlations of .90 are very rare in the social sciences—but even correlations this strong are associated with a significant amount of error. It’s important to keep error in mind when we use tests to make big decisions—like who gets hired or who gets to go to college. When we use tests to make decisions like these, the business or school is likely to benefit—slightly better prediction can result in much better returns. But there are always rejected individuals who would have performed well, and there are always accepted individuals who will perform badly.

For references, see: The complexity of national leaders’ thinking: How does it measure up?

Let’s get realistic. As I mentioned earlier, correlations of .90 are very rare. In recruitment contexts, the most predictive assessments (shown above) correlate with hire success in the range of .50 –.54, predicting from 25% – 29% of the variance in hire success. That leaves a whopping 71% – 75% of the variance unexplained, which is why the best hiring processes not only use the most predictive assessments, but also consider multiple predictive criteria.

On the other end of the spectrum, there are several common forms of assessment that explain less than 9% of the variance in recruitment success. Their correlations with recruitment success are lower than .30. Yet some of these, like 360s, reference checks, and EQ, are wildly popular. In the context of hiring, the size of the variance explained by error in these cases (more than 91%) means there is a very big risk of being unfair to a large percentage of candidates. (I’m pretty certain assessment buyers aren’t intentionally being unfair. They probably just don’t know about effect size.)

If you’ve read my earlier article about replication, you know that the power-posing research could not be replicated. You also might be interested to learn that the correlations reported in the original research were also lower than .30. If power-posing had turned out to be a proven predictor of presentation quality, the question I’d be asking myself is, “How much effort am I willing to put into power-posing when the variance explained is lower than 9%?”

If we were talking about something other than power-posing, like reducing even a small risk that my child would die of a contagious disease, I probably wouldn’t hesitate to make a big effort. But I’m not so sure about power-posing before a presentation. Practicing my presentation or getting feedback might be a better use of my time.

Summing up (for now)

A basic understanding of prediction is worth cultivating. And it’s pretty simple. You don’t even have to do any fancy calculations. Most importantly, it can save you time and tons of wasted effort by giving you a quick way to estimate the likelihood that an activity is worth doing (or product is worth having). Heck, it can even increase fairness. What’s not to like?


My organization, Lectica, Inc., is a 501(c)3 nonprofit corporation. Part of our mission is to share what we learn with the world. One of the things we’ve learned is that many assessment buyers don’t seem to know enough about statistics to make the best choices. The Statistics for all series is designed to provide assessment buyers with the knowledge they need most to become better assessment shoppers.

Statistics for all: Replication

Statistics for all: What the heck is confidence?

Statistics for all: Estimating confidence

 

Please follow and like us:

Statistics for all: Replication

(Why you should have been suspicious of power-posing from the start!)

I’ve got a free, low-tech life hack for you that will save significant time and money — and maybe even improve your health. All you need to do is one little thing. Before you let the latest research results change your behavior, check to see if the research has been replicated!

One of the hallmarks of modern science is the notion that one study of a new phenomenon—especially a single small study—proves nothing. Most of the time, the results of such studies can do little more than suggest possibilities. To arrive at proof, results have to be replicated—again and again, usually in a variety of contexts. This is important, especially in the social sciences, where phenomena are difficult to measure and the results of many new studies cannot be replicated.

Researchers used to be trained to avoid even implying that findings from a new study were proven facts. But when Amy Cuddy set out to share the results of her and her colleagues’ power-posing research, she didn’t simply imply that her results could be generalized. She unabashedly announced to an enthralled Ted Talk audience that she’d discovered a “Free, no-tech life hack…that could significantly change how your life unfolds.”

Thanks to this talk, many thousands—perhaps millions—of people-hours have been spent power-posing. But it’s not the power-posers whose lives have changed. Unfortunately, as it turns out, it’s Dr. Cuddy’s life that changed significantly—when other researchers were unable to replicate her results. In fact, because she had made such strong unwarranted claims, Dr. Cuddy became the focus of severe criticism.

Although she was singled out, Dr. Cuddy is far from alone. She’s got lots of company. Many fads have begun just like Power Posing did. Here’s how it goes: A single small study produces results that have “novelty appeal,” the Today Show picks up the story, and thousands jump on the bandwagon! Sometimes, as in the case of power-posing, the negative impact is no worse than a bit of wasted time. But in other cases, such as when our heath or pocketbooks are at stake, the impacts can be much greater.

“But it worked for me!” If you tried power-posing and believe it was responsible for your success in achieving an important goal, you may be right. The scientific method isn’t perfect — especially in the social sciences — and future studies with better designs may support your belief. However, I recommend caution in relying on personal experience. Humans have powerful built-in mental biases that lead us to conclude that positive outcomes are caused by something we did to induce them. This makes it very difficult for us to distinguish between coincidence and cause. And it’s one reason we need the scientific method, which is designed to help us reduce the impact of these biases.

Replication matters in assessment development, too

Over the last couple of decades, I’ve looked at the reliability & validity evidence for many assessments. The best assessment developers set a pretty high replication standard, conducting several validity & reliability studies for each assessment they offer. But many assessment providers—especially those serving businesses—are much more lax. In fact, many can point to only a single study of reliability and validity. To make matters worse, in some cases, that study has not been peer reviewed.

Be wary of assessments that aren’t backed by several studies of reliability and validity.


Please follow and like us:

From Piaget to Dawson: The evolution of adult developmental metrics

I've just added a new video about the evolution of adult developmental metrics to YouTube and LecticaLive. It traces the evolutionary history of Lectica's developmental model and metric.

If you are curious about the origins of our work, this video is a great place to start. If you'd like to see the reference list for this video, view it on LecticaLive.

 

 

Please follow and like us:

Adaptive learning. Are we there yet?

Adaptive learning technologies are touted as an advance in education and a harbinger of what's to come. But although we at Lectica agree that adaptive learning has a great deal to offer, we have some concerns about its current limitations. In an earlier article, I raised the question of how well one of these platforms, Knewton, serves "robust learning"—the kind of learning that leads to deep understanding and usable knowledge. Here are some more general observations.

The great strength of adaptive learning technologies is that they allow students to learn at their own pace. That's big. It's quite enough to be excited about, even if it changes nothing else about how people learn. But in our excitement about this advance, the educational community is in danger of ignoring important shortcomings of these technologies.

First, adaptive learning technologies are built on adaptive testing technologies. Today, these testing technologies are focused on "correctness." Students are moved to the next level of difficulty based on their ability to get correct answers. This is what today's testing technologies measure best. However, although being able to produce or select correct answers is important, it is not an adequate indication of understanding. And without real understanding, knowledge is not usable and can't be built upon effectively over the long term.

Second, today's adaptive learning technologies are focused on a narrow range of content—the kind of content psychometricians know how to build tests for—mostly math and science (with an awkward nod to literacy). In public education during the last 20 years, we've experienced a gradual narrowing of the curriculum, largely because of high stakes testing and its narrow focus. Today's adaptive learning technologies suffer from the same limitations and are likely to reinforce this trend.

Third, the success of adaptive learning technologies is measured with standardized tests of correctness. Higher scores will help more students get into college—after all, colleges use these tests to decide who will be admitted. But we have no idea how well higher scores on these tests translate into life success. Efforts to demonstrate the relevance of educational practices are few and far between. And notably, there are many examples of highly successful individuals who were poor players in the education game—including several of the worlds' most productive and influential people.

Fourth, some proponents of online adaptive learning believe that it can and should replace (or marginalize) teachers and classrooms. This is concerning. Education is more than a process of accumulating facts. For one thing, it plays an enormous role in socialization. Good teachers and classrooms offer students opportunities to build knowledge while learning how to engage and work with diverse others. Great teachers catalyze optimal learning and engagement by leveraging students' interests, knowledge, skills, and dispositions. They also encourage students to put what they're learning to work in everyday life—both on their own and in collaboration with others.

Lectica has a strong interest in adaptive learning and the technologies that deliver it. We anticipate that over the next few years, our assessment technology will be integrated into adaptive learning platforms to help expand their subject matter and ensure that students are building robust, usable knowledge. We will also be working hard to ensure that these platforms are part of a well-thought out, evidence-based approach to education—one that fosters the development of tomorrow's skills—the full range of skills and knowledge required for success in a complex and rapidly changing world.

Please follow and like us:

Introducing Lectica First: Front-line to mid-level recruitment assessment—on demand

The world’s best recruitment assessments—unlimited, auto-scored, affordable, relevant, and easy

Lectical Assessments have been used to support senior and executive recruitment for over 10 years, but the expense of human scoring has prohibited their use at scale. I’m delighted to report that this is no longer the case. Because of CLAS—our electronic developmental scoring system—we plan to deliver customized assessments of workplace reasoning with real time scoring. We’re calling this service Lectica First.

Lectica First is a subscription service.* It allows you to administer as many Lectica First assessments as you’d like, any time you’d like. It’s priced to make it possible for your organization to pre-screen every candidate (up through mid-level management) before you look at a single resume or call a single reference. And we’ve built in several upgrade options, so you can easily obtain additional information about the candidates that capture your interest.

learn more about Lectica First subscriptions


The current state of recruitment assessment

“Use of hiring methods with increased predictive validity leads to substantial increases in employee performance as measured in percentage increases in output, increased monetary value of output, and increased learning of job-related skills” (Hunter, Schmidt, & Judiesch, 1990).

Most conventional workplace assessments measure either ability (knowledge & skill) or perspective (opinion or perception). These assessments examine factors like literacy, numeracy, role-specific competencies, leadership traits, personality, and cultural fit, and are generally delivered through interviews, multiple choice tests, or likert-style surveys.

Lectical Assessments  are tests of mental ability (or mental skill). High-quality tests of mental ability have the highest predictive validity for recruitment purposes, hands down. The latest meta-analytic study of predictive validity shows that tests of mental abiliy are by far the best predictors of recruitment success.

Personality tests come in a distant second. In their meta-analysis of the literature, Teft, Jackson, and Rothstein (1991) reported an overall relation between personality and job performance of .24 (with conscientiousness as the best predictor by a wide margin). Translated, this means that only about 6% of job performance is predicted by personality traits. These numbers do not appear to have been challenged in more recent research (Johnson, 2001).

Predictive validity of various types of assessments used in recruitment

The following figure shows average predictive validities for various forms of assessment used in recruitment contexts. The percentages indicate how much of a role a particular form of assessment plays in predicting performance—it’s predictive power. When deciding which assessments to use in recruitment, the goal is to achieve the greatest possible predictive power with the fewest assessments.

In the figure below, assessments are color-coded to indicate which are focused on mental (cognitive) skills, behavior (past or present), or personality traits. It is clear that tests of mental skills stand out as the best predictors.

Schmidt, F. L., Oh, I.-S., & Shaffer, J. A. (2016). Working paper: The validity and utility of selection methods in personnel psychology: Practical and theoretical implications of 100 years of research findings.

Why use Lectical Assessments for recruitment?

Lectical Assessments are “next generation” assessments of mental ability, made possible through a novel synthesis of developmental theory, primary research, and technology. Until now multiple choice style ability tests have been the most affordable option for employers. But despite being far more predictive than other types of tests, these tests suffer from important limitations. Lectical Assessments address these limitations. For details, take a look at the side-by-side comparison of LecticaFirst tests with conventional tests, below.

DimensionLecticaFirstAptitude
AccuracyLevel of reliability (.95–.97) makes them accurate enough for high-stakes decision-making. (Interpreting reliability statistics)Varies greatly. The best aptitude tests have levels of reliability in the .95 range. Many recruitment tests have much lower levels.
Time investmentLectical Assessments are not timed. They usually take from 45–60 minutes, depending on the individual test-taker.Varies greatly. For acceptable accuracy, tests must have many items and may take hours to administer.
ObjectivityScores are objective (Computer scoring is blind to differences in sex, body weight, ethnicity, etc.)Scores on multiple choice tests are objective. Scores on interview-based tests are subject to several sources of bias.
ExpenseHighly affordable.Expensive.
Fit to role: complexityLectica employs sophisticated developmental tools and technologies to efficiently determine the relation between the complexity of role requirements and the level of mental skill required to meet those requirements.Lectica’s approach is not directly comparable to other available approaches.
Fit to role: relevanceLectical Assessments are readily customized to fit particular jobs, and are direct measures of what’s most important—whether or not candidates’ actual workplace reasoning skills are a good fit for a particular job.Aptitude tests measure people’s ability to select correct answers to abstract problems. It is hoped that these answers will predict how good a candidate’s workplace reasoning skills are likely to be.
Predictive validityIn research so far: Predict advancement (uncorrected R = .53**, R2 = .28), National Leadership Study.The aptitude (IQ) tests used in published research predict performance (uncorrected R = .45 to .54, R2 = .20 to .29)
CheatingThe written response format makes cheating virtually impossible when assessments are taken under observation, and very difficult when taken without observation.Cheating is relatively easy and rates can be quite high.
Formative valueHigh. Lectica First assessments can be upgraded after hiring, then used to inform employee development plans.None. Aptitude is a fixed attribute, so there is no room for growth.
Continuous improvementOur assessments are developed with a 21st century learning technology that allows us to continuously improve the predictive validity of Lectica First assessments.Conventional aptitude tests are built with a 20th century technology that does not easily lend itself to continuous improvement.

* CLAS is not yet fully calibrated for scores above 11.5 on our scale. Scores at this level are more often seen in upper- and senior-level managers and executives. For this reason, we do not recommend using Lectica First for recruitment above mid-level management.

**The US Department of Labor’s highest category of validity, labeled “Very Beneficial” requires regression coefficients .35 or higher (R > .34).

References

Arthur, W., Day, E. A., McNelly, T. A., & Edens, P. S. (2003). A meta‐analysis of the criterion‐related validity of assessment center dimensions. Personnel Psychology, 56(1), 125-153.

Becker, N., Höft, S., Holzenkamp, M., & Spinath, F. M. (2011). The Predictive Validity of Assessment Centers in German-Speaking Regions. Journal of Personnel Psychology, 10(2), 61-69.

Beehr, T. A., Ivanitskaya, L., Hansen, C. P., Erofeev, D., & Gudanowski, D. M. (2001). Evaluation of 360 degree feedback ratings: relationships with each other and with performance and selection predictors. Journal of Organizational Behavior, 22(7), 775-788.

Dawson, T. L., & Stein, Z. (2004). National Leadership Study results. Prepared for the U.S. Intelligence Community.

Gaugler, B. B., Rosenthal, D. B., Thornton, G. C., & Bentson, C. (1987). Meta-analysis of assessment center validity. Journal of Applied Psychology, 72(3), 493-511.

Hunter, J. E., & Hunter, R. F. (1984). The validity and utility of alterna­tive predictors of job performance. Psychological Bulletin, 96, 72-98.

Hunter, J. E., Schmidt, F. L., & Judiesch, M. K. (1990). Individual differences in output variability as a function of job complexity. Journal of Applied Psychology, 75, 28-42.

Johnson, J. (2001). Toward a better understanding of the relationship between personality and individual job performance. In M. R. R. Barrick, Murray R. (Ed.), Personality and work: Reconsidering the role of personality in organizations (pp. 83-120).

Mcdaniel, M. A., Schmidt, F. L., & Hunter, J., E. (1988a). A Meta-analysis of the validity of training and experience ratings in personnel selection. Personnel Psychology, 41(2), 283-309.

Mcdaniel, M. A., Schmidt, F. L., & Hunter, J., E. (1988b). Job experience correlates of job performance. Journal of Applied Psychology, 73, 327-330.

McDaniel, M. A., Whetzel, D. L., Schmidt, F. L., & Maurer, S. D. (1994). Validity of employment interviews. Journal of Applied Psychology, 79, 599-616.

Rothstein, H. R., Schmidt, F. L., Erwin, F. W., Owens, W. A., & Sparks, C. P. (1990). Biographical data in employment selection: Can validities be made generalizable? Journal of Applied Psychology, 75, 175-184.

Schmidt, F. L., Oh, I.-S., & Shaffer, J. A. (2016). Working paper: The validity and utility of selection methods in personnel psychology: Practical and theoretical implications of 100 years of research findings.

Stein, Z., Dawson, T., Van Rossum, Z., Hill, S., & Rothaizer, S. (2013, July). Virtuous cycles of learning: using formative, embedded, and diagnostic developmental assessments in a large-scale leadership program. Proceedings from ITC, Berkeley, CA.

Tett, R. P., Jackson, D. N., & Rothstein, M. (1991). Personality measures as predictors of job performance: A meta-analytic review. Personnel Psychology, 44, 703-742.

Zeidner, M., Matthews, G., & Roberts, R. D. (2004). Emotional intelligence in the workplace: A critical review. Applied psychology: An International Review, 53(3), 371-399.

Please follow and like us:

Support from neuroscience for robust, embodied learning

Human connector, by jgmarcelino from Newcastle upon Tyne, UK, via Wikimedia Commons

Fluid intelligence Connectome

For many years, we’ve been arguing that learning is best viewed as a process of creating networks of connections. We’ve defined robust learning as a process of building knowledge networks that are so well connected they allow us to put knowledge to work in a wide range of contexts. And we’ve described embodied learninga way of learning that involves the whole person and is much more than the memorization of facts, terms, definitions, rules, or procedures.

New evidence from the neurosciences provides support for this way of thinking about learning. According to research recently published in Nature, people with more connected brains—specifically those with more connections across different parts of the brain—demonstrate greater intelligence than those with less connected brains—including better problem-solving skills. And this is only one of several research projects that report similar findings.

Lectica exists because we believe that if we really want to support robust, embodied learning, we need to measure it. Our assessments are the only standardized assessments that have been deliberately developed to measure and support this kind of learning.

Please follow and like us:

The assessment triangle: correctness, coherence, & complexity

How to use the assessment triangle diagnostically

An ideal educational assessment strategy—represented above in the assessment triangle—includes three indicators of learning—correctness (content knowledge), complexity (developmental level of understanding), and coherence (quality of argumentation). Lectical Assessments focus primarily on two areas of the triangle—complexity and coherence. Complexity is measured with the Lectical Assessment System, and coherence is measured with a set of argumentation rubrics focused on mechanics, logic, and persuasiveness. We do not focus on correctness, primarily because most assessments already target correctness.

At the center of the assessment triangle is a hazy area. This represents the Goldilocks Zone—the range in which the difficulty of learning tasks is just right for a particular student. To diagnose the Goldilocks Zone, educators evaluate correctness, coherence, and complexity, plus a given learner’s level of interest and tolerance for failure.

When educators work with Lectical Assessments, they use the assessment triangle to diagnose students’ learning needs. Here are some examples:

Level of skill (low, average, high) relative to expectations
CaseComplexityCoherenceCorrectness
Case 1highlowhigh
Case 2highhighlow
Case 3lowlowhigh
Case 4highhighhigh

Case 1

This student has relatively high complexity and correctness scores, but his performance is low in coherence. Because lower coherence scores suggest that he has not yet fully integrated his existing knowledge, he is likely to benefit most from participating in interesting activities that require applying existing knowledge in relevant contexts (using VCoL).

Case 2

This student’s scores are high relative to expectations. Her knowledge appears to be well integrated, but the low correctness suggests that there are gaps in her content knowledge relative to targeted content. Here, we would suggest filling in the missing content knowledge in a way that engages the learner and allows her to integrate it into her well-developed knowledge network.

Case 3

The scores received by this student are high for correctness, while they are low for complexity and coherence. This pattern suggests that the student is memorizing content without integrating it effectively into his or her knowledge network—and may have been doing this for some time. This student is most likely to benefit from applying their existing content knowledge in personally relevant contexts (using VCoL) until their coherence and complexity scores catch up with their correctness scores.

Case 4

The scores received by this student are high for correctness, complexity, and coherence. This pattern suggests that the student has a high level of proficiency. Here, we would suggest introducing new knowledge that’s just challenging enough to keep her in her personal Goldilocks zone.

Summing up

The assessment triangle helps educators optimize learning by ensuring that students are always learning in the Goldilocks Zone. This is a good thing, because students who spend more time in the Goldilocks Zone not only enjoy learning more, they learn better and faster.

Please follow and like us:

What PISA measures. What we measure.

Like the items in Lectical Assessments, PISA items involve real-world problems. PISA developers also claim, as we do here at Lectica, that their items measure how knowledge is applied. So, why do we persist in claiming that Lectical Assessments and assessments like PISA measure different things?

Part of the answer lies in questions about what’s actually being measured, and in the meaning of terms like “real world problems” and “how knowledge is applied.” I’ll illustrate with an example from, Take the test: sample questions from OECD’s PISA assessments.

One of the reading comprehension items in “Take the test” involves a short story about a woman who is trapped in her home during a flood. Early in the story, a hungry panther arrives on her porch. The woman has a gun, which she keeps at her side as long as the panther is present. At first, it seems that she will kill the panther, but in the end, she offers it a ham hock instead.

What is being measured?

There are three sources of difficulty in the story. It’s Lectical phase is 10c — the third phase of four in level 10. Also, the story is challenging to interpret because it’s written to be a bit ambiguous. I had to read it twice in order to appreciate the subtlety of the author’s message. And it is set on the water in a rural setting, so there’s lots of language that would be new to many students. How well a student will comprehend this story hinges on their level of understanding — where they are currently performing on the Lectical Scale — and how much they know about living on the water in a rural setting. Assuming they understand the content of the story, comprehension also depends on how good students are at decoding its somewhat ambiguous message.

The first question that comes up for me is whether or not this is a good story selection for the average 15-year-old. The average phase of performance for most 15-year-olds is 10a. That’s their productive level. When we prescribe learning recommendations to students performing in 10a, we choose texts that are about 1 phase higher than their current productive level. We refer to this as the “Goldilocks zone”, because we’ve found it to be the range in which material is just difficult enough to be challenging, but not so difficult that the risk of failure is too high. Some failure is good. Constant failure is bad.

But this PISA story is intended to test comprehension; it’s not a learning recommendation or resource. Here, its difficulty level raises a different issue. In this context, the question that arises for me is, “What is reading comprehension, when the text students are asked to decode presents different challenges to students living in different environments and performing in different Lectical Levels?” Clearly, this story does not present the same challenge to students performing in phase 10a as it presents to students performing in 10c. Students performing in 10a or lower are struggling to understand the basic content of the story. Students performing in 10c are grappling with the subtlety of the message. And if the student lives in a city and knows nothing about living on the water, even a student performing at 10c is disadvantaged.

Real world problems

Now, let’s consider what it means to present a real-world problem. When we at Lectica use this term, we usually mean that the problem is ill-structured (like the world), without a “correct” answer. (We don’t even talk about correctness.) The challenges we present to learners reveal the current level of their understandings—there is always room for growth. One of our interns refers to development as a process of learning to make “better and better mistakes”. This is a VERY different mindset from the “right or wrong” mindset nurtured by conventional standardized tests.

What do PISA developers mean by “real world problem”? They clearly don’t mean without a “correct” answer. Their scoring rubrics show correct, partial (sometimes), and incorrect answers. And it doesn’t get any more subtle than that. I think what they mean by “real world” is that their problems are contextualized; they are simply set in the real world. But this is not a fundamental change in the way PISA developers think about learning. Theirs is still a model that is primarily about the ability to get right answers.

How knowledge is applied

Let’s go back to the story about the woman and the panther. After they read the story, test-takers are asked to respond to a series of multiple choice and written response questions. In one written response question they are asked, “What does the story suggest was the woman’s reason for feeding the panther?”

The scoring rubric presents a selection of potential correct answers and a set of wrong answers. (No partially correct answers here.) It’s pretty clear that when PISA developers ask “how well” students’ knowledge is applied, they’re talking about whether or not students can provide a correct answer. That’s not surprising, given what we’ve observed so far. What’s new and troubling here is that all “correct” answers are treated as though they are equivalent. Take a look at the list of choices. Do they look equally sophisticated to you?

  •  She felt sorry for it.
  • Because she knew what it felt like to be hungry.
  • Because she’s a compassionate person.
  • To help it live. (p. 77)

“She felt sorry for it.” is considered to be just as correct as “She is a compassionate person.” But we know the ideas expressed in these two statements are not equivalent. The idea of feeling sorry for can be expressed by children as early as phase 08b (6- to 7-year-olds). The idea of compassion (as sympathy) does not appear until level 10b. And the idea of being a compassionate person does not appear until 10c—even when the concept of compassion is being explicitly taught. Given that this is a test of comprehension—defined by PISA’s developers in terms of understanding and interpretation—doesn’t the student who writes, “She is a compassionate person,” deserve credit for arriving at a more sophisticated interpretation?

I’m not claiming that students can’t learn the word compassion earlier than level 10b. And I’m certainly not claiming that there is enough evidence in students’ responses to the prompt in this assessment to determine if an individual who wrote “She felt sorry for it.” meant something different from an individual who wrote, “She’s a compassionate person.” What I am arguing is that what students mean is more important than whether or not they get a right answer. A student who has constructed the notion of compassion as sympathy is expressing a more sophisticated understanding of the story than a student who can’t go further than saying the protagonist felt sorry for the panther. When we, at Lectica, talk about how well knowledge is applied, we mean, “At what level does this child appear to understand the concepts she’s working with and how they relate to one another?”

What is reading comprehension?

All of these observations lead me back to the question, “What is reading comprehension?” PISA developers define reading comprehension in terms of understanding and interpretation, and Lectical assessments measure the sophistications of students’ understanding and interpretation. It looks like our definitions are at least very similar.

We think the problem is not in the definition, but in the operationalization. PISAs items measure proxies for comprehension, not comprehension itself. Getting beyond proxies requires three ingredients.

  • First, we have to ask students to show us how they’re thinking. This means asking for verbal responses that include both judgments and justifications for those judgments.
  • Second, the questions we ask need to be more open-ended. Life is rarely about finding right answers. It’s about finding increasingly adequate answers. We need to prepare students for that reality.
  • Third, we need to engage in the careful, painstaking study of how students construct meanings over time.

This third requirement is such an ambitious undertaking that many scholars don’t believe it’s possible. But we’ve not only demonstrated that it’s possible, we’re doing it every day. We call the product of this work the Lectical™ Dictionary. It’s the first curated developmental taxonomy of meanings. You can think of it as a developmental dictionary. Aside from making it possible to create direct tests of student understanding, the Lectical Dictionary makes it easy to describe how ideas evolve over time. We can not only tell people what their scores mean, but also what they’re most likely to benefit from learning next. If you’re wondering what that means in practice, check out our demo.

Please follow and like us:

Straw men and flawed metrics

khan_constructivistTen years ago, Kirschner, Sweller, & Clark published an article entitled, Why minimal guidance during instruction does not work: An analysis of the failure of constructivist, discovery, problem-based, experiential, and inquiry-based teaching.

In this article, Kirschner and his colleagues contrast outcomes for what they call "guidance instruction" (lecture and demonstration) with those from constructivism-based instruction. They conclude that constructivist approaches produce inferior outcomes.

The article suffers from at least three serious flaws

First, the authors, in making their distinction between guided instruction and constructivist approaches, have created a caricature of constructivist approaches. Very few experienced practitioners of constructivist, discovery, problem-based, experiential, or inquiry-based teaching would characterize their approach as minimally guided. "Differently guided" would be a more appropriate term. Moreover, most educators who use constructivist approaches include lecture and demonstration where these are appropriate.

Second, the research reviewed by the authors was fundamentally flawed. For the most part, the metrics employed to evaluate different styles of instruction were not reasonable measures of the kind of learning constructivist instruction aims to support—deep understanding (the ability to apply knowledge effectively in real-world contexts). They were measures of memory or attitude. Back in 2010, Stein, Fisher, and I argued that metrics can't produce valid results if they don't actually measure what we care about  (Redesigning testing: Operationalizing the new science of learning. Why isn't this a no-brainer?

And finally, the longitudinal studies Kirschner and his colleagues reviewed had short time-spans. None of them examined the long-term impacts of different forms of instruction on deep understanding or long-term development. This is a big problem for learning research—one that is often acknowledged, but rarely addressed.

Since Kirschner's article was published in 2006, we've had an opportunity to examine the difference between schools that provide different kids of instruction, using assessments that measure the depth and coherence of students' understanding. We've documented a 3 to 5 year advantage, by grade 12, for students who attend schools that emphasize constructivist methods vs. those that use more "guidance instruction". 

To learn more, see:

Are our children learning robustly?

Lectica rationale

 

Please follow and like us: