Statistics for all: significance vs. significance

There’s a battle out there no one’s tweeting about. It involves a tension between statistical significance and practical significance. If you make decisions that involve evaluating evidence—in other words, if you are human—understanding the distinction between these two types of significance will significantly improve your decisions (both practically and statistically).

Statistical significance

Statistical significance (a.k.a. “p”) is a calculation made to determine how confident we can be that a relationship between two factors (variables) is real. The lower a p value, the more confident we can be. Most of the time, we want p to be less than .05.

Don’t be misled! A low p value tells us nothing about the size of a relationship between two variables. When someone says that statistical significance is high, all this means is that we can be more confident that the relationship is real.

Replication

Once we know we can be confident that a relationship between two variables is real, we should check to see if the research has been replicated. That’s because we can’t be sure a statistically significant relationship found in a single study is really real. After we’ve determined that a relationship is statistically significant and replicable, it’s time to consider practical significance. Practical significance has to do with the size of the relationship.

Practical significance

To figure out how practically significant a relationship is, we need to know how big it is. The size of a relationship, or effect size, is evaluated independently of p. For a plain English discussion of effect size, check out this article, Statistics for all: prediction.

Importance

The greater the size of a relationship between two variables, the more likely the relationship is to be important — but that’s not enough. To have real importance, a relationship must also matter. And it is the decision-maker who decides what matters.

Examples

Let’s look at one of my favorite examples. The results of high stakes tests like the SAT and GRE — college entrance exams made by ETS — have been shown to predict college success. Effect sizes tend to be small, but the effects are statistically significant — we can have confidence that they are real. And evidence for these effects have come from numerous studies, so we know they are really real.

If you’re the president of a college, there is little doubt that these test scores have practical significance. Improving prediction of student success, even a little, can have a big impact on the bottom line.

If you’re an employer, you’re more likely to care about how well a student did in college than how they did prior to college, so SAT and GRE scores are likely to be less important to you than college success.

If you’re a student, the size of the effect isn’t important at all. You don’t make the decision about whether or not the school is going to use the SAT or GRE to filter students. Whether or not these assessments are used is out of your control. What’s important to you is how a given college is likely to benefit you.

If you’re me, the size of the effect isn’t very important either. My perspective is that of someone who wants to see major changes in the educational system. I don’t think we’re doing our students any favors by focusing on the kind of learning that can be measured by tests like the GRE and SAT. I think our entire educational system leans toward the wrong goal—transmitting more and more “correct” information. I think we need to ask if what students are learning in school is preparing them for life.

Another thing to consider when evaluating practical significance is whether or not a relationship between two variables tells us only part of a more complex story. For example, the relationship between ethnicity and the rate of developmental growth (what my colleagues and I specialize in measuring) is highly statistically significant (real) and fairly strong (moderate effect size). But, this relationship completely disappears once socioeconomic status (wealth) is taken into account. The first relationship is misleading (spurious). The real culprit is poverty. It’s a social problem, not an ethnic problem.

Summing up

Most discussions of practical significance stop with effect size. From a statistical perspective, this makes sense. Statistics can’t be used to determine which outcomes matter. People have to do that part, but statistics, when good ones are available, should come first. Here’s my recipe:

  1. Find out if the relationship is real (p < .05).
  2. Find out if it is really real (replication).
  3. Consider the effect size.
  4. Decide how much it matters.

My organization, Lectica, Inc., is a 501(c)3 nonprofit corporation. Part of our mission is to share what we learn with the world. One of the things we’ve learned is that many assessment buyers don’t seem to know enough about statistics to make the best choices. The Statistics for all series is designed to provide assessment buyers with the knowledge they need most to become better assessment shoppers.

 

Please follow and like us:

Statistics for all: Prediction

Why you might want to reconsider using 360s and EQ assessments to predict recruitment success


Measurements are often used to make predictions. For example, they can help predict how tall a 4-year-old is likely to be in adulthood, which students are likely to do better in an academic program, or which candidates are most likely to succeed in a particular job.

Some of the attributes we measure are strong predictors, others are weaker. For example, a child’s height at age 4 is a pretty strong predictor of adult height. Parental height is a weaker predictor. The complexity of a person’s workplace decision making, on its own, is a moderate predictor of success in the workplace. But the relation between the complexly of their workplace decision making and the complexity of their role is a strong predictor.

How do we determine the strength or a predictor? In statistics, the strength of predictions is represented by an effect size. Most effect size indicators are expressed as decimals and range from .00 –1.00, with 1.00 representing 100% accuracy of prediction. The effect size indicator you’ll see most often is r-square. If you’ve ever been forced to take a statistics course—;)—you may remember that r represents the strength of a correlation. Before I explain r-square, let’s look at some correlation data.

The four figures below represent 4 different correlations, from weakest (.30) to strongest (.90). Let’s say the vertical axis (40 –140) represents the level of success in college, and the horizontal axis (50 –150) represents scores on one of 4 college entrance exams. The dots represent students. If you were trying to predict success in college, you would be wise to choose the college entrance exam that delivered an r of .90.

Why is an r of .90 preferable? Well, take a look at the next set of figures. I’ve drawn lines through the clouds of dots (students) to show regression lines. These lines represent the prediction we would make about how successful a student will be, given a particular score. It’s clear that in the case of the first figure (r =.30), this prediction is likely to be pretty inaccurate. Many students perform better or worse than predicted by the regression line. But as the correlations increase in size, prediction improves. In the case of the fourth figure (r =.90), the prediction is most accurate.

What does a .90 correlation mean in practical terms? That’s where r-square comes in. If we multiply .90 by .90 (calculate the square), we get an r-square of .81. Statisticians would say that the predictor (test score), explains 81% of the variance in college success. The 19% of the variance that’s not explained (1.00 -.81 =.19) represents the percent of the variance that is due to error (unexplained variance). The square root of 19% is the amount of error (.44).

Even when r = .90, error accounts for 19% of the variance.

Correlations of .90 are very rare in the social sciences—but even correlations this strong are associated with a significant amount of error. It’s important to keep error in mind when we use tests to make big decisions—like who gets hired or who gets to go to college. When we use tests to make decisions like these, the business or school is likely to benefit—slightly better prediction can result in much better returns. But there are always rejected individuals who would have performed well, and there are always accepted individuals who will perform badly.

For references, see: The complexity of national leaders’ thinking: How does it measure up?

Let’s get realistic. As I mentioned earlier, correlations of .90 are very rare. In recruitment contexts, the most predictive assessments (shown above) correlate with hire success in the range of .50 –.54, predicting from 25% – 29% of the variance in hire success. That leaves a whopping 71% – 75% of the variance unexplained, which is why the best hiring processes not only use the most predictive assessments, but also consider multiple predictive criteria.

On the other end of the spectrum, there are several common forms of assessment that explain less than 9% of the variance in recruitment success. Their correlations with recruitment success are lower than .30. Yet some of these, like 360s, reference checks, and EQ, are wildly popular. In the context of hiring, the size of the variance explained by error in these cases (more than 91%) means there is a very big risk of being unfair to a large percentage of candidates. (I’m pretty certain assessment buyers aren’t intentionally being unfair. They probably just don’t know about effect size.)

If you’ve read my earlier article about replication, you know that the power-posing research could not be replicated. You also might be interested to learn that the correlations reported in the original research were also lower than .30. If power-posing had turned out to be a proven predictor of presentation quality, the question I’d be asking myself is, “How much effort am I willing to put into power-posing when the variance explained is lower than 9%?”

If we were talking about something other than power-posing, like reducing even a small risk that my child would die of a contagious disease, I probably wouldn’t hesitate to make a big effort. But I’m not so sure about power-posing before a presentation. Practicing my presentation or getting feedback might be a better use of my time.

Summing up (for now)

A basic understanding of prediction is worth cultivating. And it’s pretty simple. You don’t even have to do any fancy calculations. Most importantly, it can save you time and tons of wasted effort by giving you a quick way to estimate the likelihood that an activity is worth doing (or product is worth having). Heck, it can even increase fairness. What’s not to like?


My organization, Lectica, Inc., is a 501(c)3 nonprofit corporation. Part of our mission is to share what we learn with the world. One of the things we’ve learned is that many assessment buyers don’t seem to know enough about statistics to make the best choices. The Statistics for all series is designed to provide assessment buyers with the knowledge they need most to become better assessment shoppers.

Statistics for all: Replication

Statistics for all: What the heck is confidence?

Statistics for all: Estimating confidence

 

Please follow and like us:

Statistics for all: Replication

(Why you should have been suspicious of power-posing from the start!)

I’ve got a free, low-tech life hack for you that will save significant time and money — and maybe even improve your health. All you need to do is one little thing. Before you let the latest research results change your behavior, check to see if the research has been replicated!

One of the hallmarks of modern science is the notion that one study of a new phenomenon—especially a single small study—proves nothing. Most of the time, the results of such studies can do little more than suggest possibilities. To arrive at proof, results have to be replicated—again and again, usually in a variety of contexts. This is important, especially in the social sciences, where phenomena are difficult to measure and the results of many new studies cannot be replicated.

Researchers used to be trained to avoid even implying that findings from a new study were proven facts. But when Amy Cuddy set out to share the results of her and her colleagues’ power-posing research, she didn’t simply imply that her results could be generalized. She unabashedly announced to an enthralled Ted Talk audience that she’d discovered a “Free, no-tech life hack…that could significantly change how your life unfolds.”

Thanks to this talk, many thousands—perhaps millions—of people-hours have been spent power-posing. But it’s not the power-posers whose lives have changed. Unfortunately, as it turns out, it’s Dr. Cuddy’s life that changed significantly—when other researchers were unable to replicate her results. In fact, because she had made such strong unwarranted claims, Dr. Cuddy became the focus of severe criticism.

Although she was singled out, Dr. Cuddy is far from alone. She’s got lots of company. Many fads have begun just like Power Posing did. Here’s how it goes: A single small study produces results that have “novelty appeal,” the Today Show picks up the story, and thousands jump on the bandwagon! Sometimes, as in the case of power-posing, the negative impact is no worse than a bit of wasted time. But in other cases, such as when our heath or pocketbooks are at stake, the impacts can be much greater.

“But it worked for me!” If you tried power-posing and believe it was responsible for your success in achieving an important goal, you may be right. The scientific method isn’t perfect — especially in the social sciences — and future studies with better designs may support your belief. However, I recommend caution in relying on personal experience. Humans have powerful built-in mental biases that lead us to conclude that positive outcomes are caused by something we did to induce them. This makes it very difficult for us to distinguish between coincidence and cause. And it’s one reason we need the scientific method, which is designed to help us reduce the impact of these biases.

Replication matters in assessment development, too

Over the last couple of decades, I’ve looked at the reliability & validity evidence for many assessments. The best assessment developers set a pretty high replication standard, conducting several validity & reliability studies for each assessment they offer. But many assessment providers—especially those serving businesses—are much more lax. In fact, many can point to only a single study of reliability and validity. To make matters worse, in some cases, that study has not been peer reviewed.

Be wary of assessments that aren’t backed by several studies of reliability and validity.


Please follow and like us:

Statistics for all: Estimating confidence

In the first post in this series, I promised to share a quick and dirty trick for determining how much confidence you can have in a test score. I will. But first, I want to show you a bit more about what estimating confidence means when it comes to educational and psychological tests.

Let’s start with a look at how test scores are usually reported. The figure below shows three scores, one at level 8, one at level 6, and one at level 4. Looking at this figure, most of us would be inclined to assume that these scores are what they seem to be—precise indicators of the level of a trait or skill.

How test scores are usually presented

But this is not the case. Test scores are fuzzy. They’re best understood as ranges rather than as points on a ruler. In other words, test scores are always surrounded by confidence intervals. A person’s true score is likely to fall somewhere in the range described by the confidence interval around a test score.

In order to figure out how fuzzy a test score actually is, you need one thing—an indicator of statistical reliability. Most of the time, this is something called Cronbach’s Alpha. All good test developers publish information about the statistical reliability of their measures, ideally in refereed academic journals with easy to find links on their web sites! If a test developer won’t provide you with information about Alpha (or its equivalent) for each score reported on a test, it’s best to move on.

The higher the reliability (usually Alpha) the smaller the confidence interval. And the smaller the confidence interval, the more confidence you can have in a test score.

The table below will help to clarify why it is important to know Alpha (or its equivalent). It shows the relationship between Alpha (which can range from 0 to 1.0) and the number of distinct levels (strata) a test can be said to have. For example, an assessment with a reliability of .80, has 3 strata, whereas an assessment with a reliability of .94 has 5.

ReliabilityStrata
.702
.803
.904
.945
.956
.967
.978
.989

Strata have direct implications for the confidence we can have in a person’s score on a given assessment, because they tell us about the range within which a person’s true score would fall—its confidence interval—given the score awarded.

Imagine that you have just taken a test of emotional intelligence with a score range of 1 to 10 and a reliability of .95. The number of strata into which an assessment with a reliability of .95 can be divided is about 6, which means that each strata equals about 1.75 points on the 10 point scale (10 divided by 6). If your score on this test was 8, your true score would likely be somewhere between 7.13 and 8.88—your score’s confidence interval.

The figure below shows the true score ranges for three test takers, CB, RM, and PR. The fact that these ranges don’t overlap gives us confidence that the emotional intelligence of these test-takers is actually different**.

If these scores were closer together, their confidence intervals would overlap. And if that was the case—for example if you were comparing two individuals with scores of 8 and 8.5—it would not be correct to say the scores were different form one another. In fact, it would be incorrect for a hiring manager to consider the difference between a score of 8 and a score of 8.5 in making a choice between two job candidates.

By the way, tests with Alphas in the range of .94 or higher are considered suitable for high-stakes use (assuming that they meet other essential validity requirements). What you see in the figure below is about as good as it gets in educational and psychological assessment.

estimating confidence when alpha is .95

Most assessments used in organizations do not have Alphas that are anywhere near .95. Some of the better assessments have Alphas as high as .85. Let’s take a look at what an Alpha at this level does to confidence intervals.

If the test you have taken has a score range of 1–10 and an Alpha (reliability) of .85, the number of strata into which this assessment can be divided is about 3.4, which means that each strata equals about 2.9 (10 divided by 3.4) points on the 10 point scale. In this case, if you receive a score of 8, your true score is likely to fall within the range of 6.6 to 9.5*.

In the figure below, note that CB’s true score range now overlaps RM’s true score range and RM’s true score range overlaps PR’s true score range. This means we cannot say—with confidence—that CB’s score is different from RM’s score, or that RM’s score is different from PR’s score.

Assessments with Alphas in the .85 range are suitable for classroom use or low-stakes contexts. Yet, every day, schools and businesses use tests with reliabilities in the .85 range to make high stakes decisions—such as who will be selected for advancement or promotion. And this is often done in a way that would exclude RM (yellow circle) even though his confidence interval overlaps CB’s (teal circle) confidence interval.

estimating confidence when alpha is .85

Many tests used in organizations have Alphas in the .75 range. If the test you have taken has a score range of 1–10 and an Alpha of .75, the number of strata into which this assessment can be divided is about 2.2, which means that each strata equals about 4.5 points on the 10 point scale. In this case, if you receive a score of 8, your true score is likely to fall within the range of 6–10*.

As shown in the figure below, scores would now have to differ by at least 4.5 points in order for us to distinguish between two people. CB’s and PR’s scores are different, but RM’s score is uninterpretable.

Tests or sub-scales with alphas in the .75 range are considered suitable for research purposes. Yet, sad to say, schools and businesses now use tests with scales or sub-scales that have Alphas in or below the .75 range, treating these scores as if they provide useful information, when in most cases the scores—like RM’s—are uninterpretable.

estimating confidence when alpha is .75

If your current test providers are not reporting true score ranges (confidence intervals), ask for them. If they only provide Alphas (reliability statistics) you can use the table and figures in this article to calculate true score ranges for yourself. If you don’t want to do the math, no problem. You can use the figures above to get a feel for how precise a score is.

Statistical reliability is only one of the ways in which assessments should be evaluated. Test developers should also ask how well an assessment measures what it is intended to measure. And those who use an assessment should ask whether or not what it measures is relevant or important. I’ll be sharing some tricks for looking at these forms of validity in future articles.

Related Articles

Statistics for all: What the heck is confidence?


*This range will be wider at the top and bottom of the scoring range and a bit narrower in the middle of the range.

**It doesn’t tell us if emotional intelligence is important. That is determined in other ways.


References

Guilford J. P. (1965). Fundamental statistics in psychology and education. 4th Edn. New York: McGraw-Hill.

Kubiszyn T., Borich G. (1993). Educational testing and measurement. New York: Harper Collins.

Wright B. D. (1996). Reliability and separation. Rasch Measurement Transactions, 9, 472.

 

Please follow and like us:

Statistics for all: What the heck is confidence?

Confidence in testing

I doubt there is a person in the Western world over the age of 4 who hasn’t taken a psychological or educational test. Yet very few of us know one of the most important facts about these tests—their scores are always imprecise.

When you measure the height of a child, you can be pretty confident that the measurement you make is correct within a fraction of an inch on either side. And if you check the time on your mobile phone, you can be pretty certain that it is accurate within a fraction of a minute on either side. Rulers and clocks are well-calibrated measures that we can use with great confidence if we use them correctly. The same is true of measures of temperature, speed, frequency, and weight.

But even measurements made with these metrics are more or less precise. They’re correct within a range. These ranges are called confidence intervals. The confidence interval around the measurement of a child’s height would be expressed as something like “82 centimeters plus or minus 1/2 of a centimeter.” Statisticians would say that the child’s true height is likely to be somewhere in this range.

Scores on educational and psychological tests have confidence intervals too. But there is a difference between these confidence intervals and those for physical measurements. The confidence intervals around scores on psychological and educaitonal tests are larger than the confidence intervals around measurements in the physical world. How much larger? Let’s look at an example.

The psychological and educational tests with the smallest confidence intervals are those made by high-stakes test developers like ETS. For their high stakes tests — the ones used to make decisions like who gets to go to which college — they set the highest standard. This standard, if it was applied to measuring height, would allow us to to say something along the lines of, “We’re confident that this child is 82 centimeters tall, give or take 8 centimeters.”

Now, you may argue that 8 centimeters isn’t all that much, but if you’re buying a car seat or deciding who gets to ride a roller coaster, it could be the difference between life and death. Measurement precision matters.

The more imprecise our measurements are — the bigger the confidence intervals around them — the more careful we need to be about the kinds of decisions we make with them. When it comes to educational and psychological assessment, I think we’re far too careless. Too many people who buy and use assessments don’t know enough about statistics to make well-informed assessment decisions.

Fortunately, I believe we can remedy this! And it seems to me that the best place to begin is with confidence, so, in the next article in this series I’m going to share a super-easy way to figure out how much confidence you can have in any test’s scores.

 

Please follow and like us: