There’s a battle out there no one’s tweeting about. It involves a tension between statistical significance and practical significance. If you make decisions that involve evaluating evidence—in other words, if you are human—understanding the distinction between these two types of significance will significantly improve your decisions (both practically and statistically).
Statistical significance (a.k.a. “p”) is a calculation made to determine how confident we can be that a relationship between two factors (variables) is real. The lower a p value, the more confident we can be. Most of the time, we want p to be less than .05.
Don’t be misled! A low p value tells us nothing about the size of a relationship between two variables. When someone says that statistical significance is high, all this means is that we can be more confident that the relationship is real.
Once we know we can be confident that a relationship between two variables is real, we should check to see if the research has been replicated. That’s because we can’t be sure a statistically significant relationship found in a single study is really real. After we’ve determined that a relationship is statistically significant and replicable, it’s time to consider practical significance. Practical significance has to do with the size of the relationship.
To figure out how practically significant a relationship is, we need to know how big it is. The size of a relationship, or effect size, is evaluated independently of p. For a plain English discussion of effect size, check out this article, Statistics for all: prediction.
The greater the size of a relationship between two variables, the more likely the relationship is to be important — but that’s not enough. To have real importance, a relationship must also matter. And it is the decision-maker who decides what matters.
Let’s look at one of my favorite examples. The results of high stakes tests like the SAT and GRE — college entrance exams made by ETS — have been shown to predict college success. Effect sizes tend to be small, but the effects are statistically significant — we can have confidence that they are real. And evidence for these effects have come from numerous studies, so we know they are really real.
If you’re the president of a college, there is little doubt that these test scores have practical significance. Improving prediction of student success, even a little, can have a big impact on the bottom line.
If you’re an employer, you’re more likely to care about how well a student did in college than how they did prior to college, so SAT and GRE scores are likely to be less important to you than college success.
If you’re a student, the size of the effect isn’t important at all. You don’t make the decision about whether or not the school is going to use the SAT or GRE to filter students. Whether or not these assessments are used is out of your control. What’s important to you is how a given college is likely to benefit you.
If you’re me, the size of the effect isn’t very important either. My perspective is that of someone who wants to see major changes in the educational system. I don’t think we’re doing our students any favors by focusing on the kind of learning that can be measured by tests like the GRE and SAT. I think our entire educational system leans toward the wrong goal—transmitting more and more “correct” information. I think we need to ask if what students are learning in school is preparing them for life.
Another thing to consider when evaluating practical significance is whether or not a relationship between two variables tells us only part of a more complex story. For example, the relationship between ethnicity and the rate of developmental growth (what my colleagues and I specialize in measuring) is highly statistically significant (real) and fairly strong (moderate effect size). But, this relationship completely disappears once socioeconomic status (wealth) is taken into account. The first relationship is misleading (spurious). The real culprit is poverty. It’s a social problem, not an ethnic problem.
Most discussions of practical significance stop with effect size. From a statistical perspective, this makes sense. Statistics can’t be used to determine which outcomes matter. People have to do that part, but statistics, when good ones are available, should come first. Here’s my recipe:
- Find out if the relationship is real (p < .05).
- Find out if it is really real (replication).
- Consider the effect size.
- Decide how much it matters.
My organization, Lectica, Inc., is a 501(c)3 nonprofit corporation. Part of our mission is to share what we learn with the world. One of the things we’ve learned is that many assessment buyers don’t seem to know enough about statistics to make the best choices. The Statistics for all series is designed to provide assessment buyers with the knowledge they need most to become better assessment shoppers.
- Statistics for all: What the heck is confidence?
- Statistics for all: Estimating confidence
- Statistics for all: Replication
- Statistics for all: Prediction