Statistics for all: Prediction

Why you might want to reconsider using 360s and EQ assessments to predict recruitment success


Measurements are often used to make predictions. For example, they can help predict how tall a 4-year-old is likely to be in adulthood, which students are likely to do better in an academic program, or which candidates are most likely to succeed in a particular job.

Some of the attributes we measure are strong predictors, others are weaker. For example, a child’s height at age 4 is a pretty strong predictor of adult height. Parental height is a weaker predictor. The complexity of a person’s workplace decision making, on its own, is a moderate predictor of success in the workplace. But the relation between the complexly of their workplace decision making and the complexity of their role is a strong predictor.

How do we determine the strength or a predictor? In statistics, the strength of predictions is represented by an effect size. Most effect size indicators are expressed as decimals and range from .00 –1.00, with 1.00 representing 100% accuracy of prediction. The effect size indicator you’ll see most often is r-square. If you’ve ever been forced to take a statistics course—;)—you may remember that r represents the strength of a correlation. Before I explain r-square, let’s look at some correlation data.

The four figures below represent 4 different correlations, from weakest (.30) to strongest (.90). Let’s say the vertical axis (40 –140) represents the level of success in college, and the horizontal axis (50 –150) represents scores on one of 4 college entrance exams. The dots represent students. If you were trying to predict success in college, you would be wise to choose the college entrance exam that delivered an r of .90.

Why is an r of .90 preferable? Well, take a look at the next set of figures. I’ve drawn lines through the clouds of dots (students) to show regression lines. These lines represent the prediction we would make about how successful a student will be, given a particular score. It’s clear that in the case of the first figure (r =.30), this prediction is likely to be pretty inaccurate. Many students perform better or worse than predicted by the regression line. But as the correlations increase in size, prediction improves. In the case of the fourth figure (r =.90), the prediction is most accurate.

What does a .90 correlation mean in practical terms? That’s where r-square comes in. If we multiply .90 by .90 (calculate the square), we get an r-square of .81. Statisticians would say that the predictor (test score), explains 81% of the variance in college success. The 19% of the variance that’s not explained (1.00 -.81 =.19) represents the percent of the variance that is due to error (unexplained variance). The square root of 19% is the amount of error (.44).

Even when r = .90, error accounts for 19% of the variance.

Correlations of .90 are very rare in the social sciences—but even correlations this strong are associated with a significant amount of error. It’s important to keep error in mind when we use tests to make big decisions—like who gets hired or who gets to go to college. When we use tests to make decisions like these, the business or school is likely to benefit—slightly better prediction can result in much better returns. But there are always rejected individuals who would have performed well, and there are always accepted individuals who will perform badly.

For references, see: The complexity of national leaders’ thinking: How does it measure up?

Let’s get realistic. As I mentioned earlier, correlations of .90 are very rare. In recruitment contexts, the most predictive assessments (shown above) correlate with hire success in the range of .50 –.54, predicting from 25% – 29% of the variance in hire success. That leaves a whopping 71% – 75% of the variance unexplained, which is why the best hiring processes not only use the most predictive assessments, but also consider multiple predictive criteria.

On the other end of the spectrum, there are several common forms of assessment that explain less than 9% of the variance in recruitment success. Their correlations with recruitment success are lower than .30. Yet some of these, like 360s, reference checks, and EQ, are wildly popular. In the context of hiring, the size of the variance explained by error in these cases (more than 91%) means there is a very big risk of being unfair to a large percentage of candidates. (I’m pretty certain assessment buyers aren’t intentionally being unfair. They probably just don’t know about effect size.)

If you’ve read my earlier article about replication, you know that the power-posing research could not be replicated. You also might be interested to learn that the correlations reported in the original research were also lower than .30. If power-posing had turned out to be a proven predictor of presentation quality, the question I’d be asking myself is, “How much effort am I willing to put into power-posing when the variance explained is lower than 9%?”

If we were talking about something other than power-posing, like reducing even a small risk that my child would die of a contagious disease, I probably wouldn’t hesitate to make a big effort. But I’m not so sure about power-posing before a presentation. Practicing my presentation or getting feedback might be a better use of my time.

Summing up (for now)

A basic understanding of prediction is worth cultivating. And it’s pretty simple. You don’t even have to do any fancy calculations. Most importantly, it can save you time and tons of wasted effort by giving you a quick way to estimate the likelihood that an activity is worth doing (or product is worth having). Heck, it can even increase fairness. What’s not to like?


My organization, Lectica, Inc., is a 501(c)3 nonprofit corporation. Part of our mission is to share what we learn with the world. One of the things we’ve learned is that many assessment buyers don’t seem to know enough about statistics to make the best choices. The Statistics for all series is designed to provide assessment buyers with the knowledge they need most to become better assessment shoppers.

Statistics for all: Replication

Statistics for all: What the heck is confidence?

Statistics for all: Estimating confidence

 

Please follow and like us:

Lectica’s Human Capital Value Chain—for organizations that are serious about human development

Lectica's tools and services have powerful applications for every process in the human capital value chain. I explain how in the following video.

For links to more information see the HCVC page on Lecticalive. For references that support claims made in the video, see the post—Introducing LecticaFirst.

 

Please follow and like us:

Introducing LecticaFirst: Front-line to mid-level recruitment assessment—on demand

LecticaLive logo

The world's best recruitment assessments—unlimited, auto-scored, affordable, relevant, and easy

Lectical Assessments have been used to support senior and executive recruitment for over 10 years, but the expense of human scoring has prohibited their use at scale. I'm DELIGHTED to report that this is no longer the case. Because of CLAS—our electronic developmental scoring system—this fall we plan to deliver customized assessments of workplace reasoning with real time scoring. We're calling this service LecticaFirst.

LecticaFirst is a subscription service.* It allows you to administer as many LecticaFirst assessments as you'd like, any time you'd like. It's priced to make it possible for your organization to pre-screen every candidate (up through mid-level management) before you look at a single resume or call a single reference. And we've built in several upgrade options, so you can easily obtain additional information about the candidates that capture your interest.

learn more about LecticaFirst subscriptions


The current state of recruitment assessment

"Use of hiring methods with increased predictive validity leads to substantial increases in employee performance as measured in percentage increases in output, increased monetary value of output, and increased learning of job-related skills" (Hunter, Schmidt, & Judiesch, 1990).

Most conventional workplace assessments focus on one of two broad constructs—aptitude or personality. These assessments examine factors like literacy, numeracy, role-specific competencies, leadership traits, and cultural fit, and are generally delivered through interviews, multiple choice tests, or likert-style surveys. Emotional intelligence is also sometimes measured, but thus far, is not producing results that can complete with aptitude tests (Zeidner, Matthews, & Roberts, 2004). 

Like Lectical Assessments, aptitude tests are tests of mental ability (or mental skill). High-quality tests of mental ability have the highest predictive validity for recruitment purposes, hands down. Hunter and Hunter (1984), in their systematic review of the literature, found an effective range of predictive validity for aptitude tests of .45 to .54. Translated, this means that about 20% to 29% of success on the job was predicted by mental ability. These numbers do not appear to have changed appreciably since Hunter and Hunter's 1984 review.

Personality tests come in a distant second. In their meta-analysis of the literature, Teft, Jackson, and Rothstein (1991) reported an overall relation between personality and job performance of .24 (with conscientiousness as the best predictor by a wide margin). Translated, this means that only about 6% of job performance is predicted by personality traits. These numbers do not appear to have been challenged in more recent research (Johnson, 2001).

Predictive validity of various types of assessments used in recruitment

The following table shows average predictive validities for various forms of assessment used in recruitment contexts. The column "variance explained" is an indicator of how much of a role a particular form of assessment plays in predicting performance—it's predictive power. When deciding which assessments to use in recruitment, the goal is to achieve the greatest possible predictive power with the fewest assessments. That's why I've included the last column, "variance explained with GMA." It shows what happens to the variance explained when an assessment of General Mental Ability is combined with the form of assessment in a given row. The best combinations shown here are GMA and work sample tests, GMA and Integrity, and GMA and conscientiousness.

Form of assessment Source Predictive validity Variance explained  Variance explained (with GMA)
Complexity of workplace reasoning (Dawson & Stein, 2004; Stein, Dawson, Van Rossum, Hill, & Rothaizer, 2003) .53 28% n/a
Aptitude (General Mental Ability, GMA) (Hunter, 1980; Schmidt & Hunter, 1998) .51 26% n/a
Work sample tests (Hunter & Hunter, 1984; Schmidt & Hunter, 1998) .54 29% 40%
Integrity (Ones, Viswesvaran, and Schmidt, 1993; Schmidt & Hunter, 1998) .41 17% 42%
Conscientiousness (Barrick & Mount, 1995; Schmidt & Hunter, 1998). .31 10% 36%
Employment interviews (structured) (McDaniel, Whetzel, Schmidt, and Mauer, 1994; Schmidt & Hunter, 1998) .51 26% 39%
Employment interviews (unstructured) (McDaniel, Whetzel, Schmidt, and Mauer, 1994 Schmidt & Hunter, 1998) .38 14% 30%

Job knowledge tests

(Hunter and Hunter, 1984; Schmidt & Hunter, 1998) .48 23% 33%

Job tryout procedure

(Hunter and Hunter, 1984; Schmidt & Hunter, 1998) .44 19% 33%
Peer ratings (Hunter and Hunter, 1984; Schmidt & Hunter, 1998) .49 24% 33%

Training & experience: behavioral consistency method

(McDaniel, Schmidt, and Hunter, 1988a, 1988b; Schmidt & Hunter, 1998; Schmidt, Ones, and Hunter, 1992) .45 20% 33%
Reference checks (Hunter and Hunter, 1984; Schmidt & Hunter, 1998) .26 7% 32%
Job experience (years)

Hunter, 1980); McDaniel, Schmidt, and Hunter, 1988b; Schmidt & Hunter, 1998) 

.18 3% 29%  
Biographical data measures

Supervisory Profile Record Biodata Scale (Rothstein, Schmidt, Erwin, Owens, and Sparks, 1990; Schmidt & Hunter, 1998)

.35 12% 27%

Assessment centers

(Gaugler, Rosenthal, Thornton, and Benson, 1987; Schmidt & Hunter, 1998; Becker, Höft, Holzenkamp, & Spinath, 2011) Note: Arthur, Day, McNelly, & Edens (2003) found a predictive validity of .45 for assessment centers that included mental skills assessments. 

.37 14% 28%
EQ (Zeidner, Matthews, & Roberts, 2004) .24 6% n/a
360 assessments Beehr, Ivanitskaya, Hansen, Erofeev, & Gudanowski, 2001 .24 6% n/a
Training &  experience: point method (McDaniel, Schmidt, and Hunter, 1988a; Schmidt & Hunter, 1998) .11 1% 27%
Years of education (Hunter and Hunter, 1984; Schmidt & Hunter, 1998) .10 1% 27%
Interests (Schmidt & Hunter, 1998) .10 1% 27%

The figure below shows the predictive power information from this table in graphical form. Assessments are color coded to indicate which are focused on mental (cognitive) skills, behavior (past or present), or personality traits. It is clear that tests of mental skills stand out as the best predictors.

Predictive power graph

Why use Lectical Assessments for recruitment?

Lectical Assessments are "next generation" assessments, made possible through a novel synthesis of developmental theory, primary research, and technology. Until now multiple choice style aptitude tests have been the most affordable option for employers. But despite being more predictive than personality tests, aptitude tests still suffer from important limitations. Lectical Assessments address these limitations. For details, take a look at the side-by-side comparison of LecticaFirst tests with conventional tests, below.

Dimension LecticaFirst Aptitude
Accuracy Level of reliability (.95–.97) makes them accurate enough for high-stakes decision-making. (Interpreting reliability statistics) Varies greatly. The best aptitude tests have levels of reliability in the .95 range. Many recruitment tests have much lower levels.
Time investment Lectical Assessments are not timed. They usually take from 45–60 minutes, depending on the individual test-taker. Varies greatly. For acceptable accuracy, tests must have many items and may take hours to administer.
Objectivity Scores are objective (Computer scoring is blind to differences in sex, body weight, ethnicity, etc.) Scores on multiple choice tests are objective. Scores on interview-based tests are subject to several sources of bias.
Expense Highly competitive subscription. (From $6 – $10) per existing employee annually Varies greatly.
Fit to role: complexity Lectica employs sophisticated developmental tools and technologies to efficiently determine the relation between role requirements and the level of reasoning skill required to meet those requirements. Lectica's approach is not directly comparable to other available approaches.
Fit to role: relevance Lectical Assessments are readily customized to fit particular jobs, and are direct measures of what's most important—whether or not candidates' actual workplace reasoning skills are a good fit for a particular job. Aptitude tests measure people's ability to select correct answers to abstract problems. It is hoped that these answers will predict how good a candidate's workplace reasoning skills are likely to be.
Predictive validity In research so far: Predict advancement (R = .53**, R2 = .28), National Leadership Study. The aptitude (IQ) tests used in published research predict performance (R = .45 to .54, R2 = .20 to .29)
Cheating The written response format makes cheating virtually impossible when assessments are taken under observation, and very difficult when taken without observation. Cheating is relatively easy and rates can be quite high.
Formative value High. LecticaFirst assessments can be upgraded after hiring, then used to inform employee development plans. None. Aptitude is a fixed attribute, so there is no room for growth. 
Continuous improvement Our assessments are developed with a 21st century learning technology that allows us to continuously improve the predictive validity of Lecticafirst assessments. Conventional aptitude tests are built with a 20th century technology that does not easily lend itself to continuous improvement.

* CLAS is not yet fully calibrated for scores above 11.5 on our scale. Scores at this level are more often seen in upper- and senior-level managers and executives. For this reason, we do not recommend using lecticafirst for recruitment above mid-level management.

**The US Department of Labor’s highest category of validity, labeled “Very Beneficial” requires regression coefficients .35 or higher (R > .34).


References

Arthur, W., Day, E. A., McNelly, T. A., & Edens, P. S. (2003). A meta‐analysis of the criterion‐related validity of assessment center dimensions. Personnel Psychology, 56(1), 125-153.

Becker, N., Höft, S., Holzenkamp, M., & Spinath, F. M. (2011). The Predictive Validity of Assessment Centers in German-Speaking Regions. Journal of Personnel Psychology, 10(2), 61-69.

Beehr, T. A., Ivanitskaya, L., Hansen, C. P., Erofeev, D., & Gudanowski, D. M. (2001). Evaluation of 360 degree feedback ratings: relationships with each other and with performance and selection predictors. Journal of Organizational Behavior, 22(7), 775-788.

Dawson, T. L., & Stein, Z. (2004). National Leadership Study results. Prepared for the U.S. Intelligence Community.

Gaugler, B. B., Rosenthal, D. B., Thornton, G. C., & Bentson, C. (1987). Meta-analysis of assessment center validity. Journal of Applied Psychology, 72(3), 493-511.

Hunter, J. E., & Hunter, R. F. (1984). The validity and utility of alterna­tive predictors of job performance. Psychological Bulletin, 96, 72-98.

Hunter, J. E., Schmidt, F. L., & Judiesch, M. K. (1990). Individual differences in output variability as a function of job complexity. Journal of Applied Psychology, 75, 28-42.

Johnson, J. (2001). Toward a better understanding of the relationship between personality and individual job performance. In M. R. R. Barrick, Murray R. (Ed.), Personality and work: Reconsidering the role of personality in organizations (pp. 83-120).

Mcdaniel, M. A., Schmidt, F. L., & Hunter, J., E. (1988a). A Meta-analysis of the validity of training and experience ratings in personnel selection. Personnel Psychology, 41(2), 283-309.

Mcdaniel, M. A., Schmidt, F. L., & Hunter, J., E. (1988b). Job experience correlates of job performance. Journal of Applied Psychology, 73, 327-330.

McDaniel, M. A., Whetzel, D. L., Schmidt, F. L., & Maurer, S. D. (1994). Validity of employment interviews. Journal of Applied Psychology, 79, 599-616.

Rothstein, H. R., Schmidt, F. L., Erwin, F. W., Owens, W. A., & Sparks, C. P. (1990). Biographical data in employment selection: Can validities be made generalizable? Journal of Applied Psychology, 75, 175-184.

Stein, Z., Dawson, T., Van Rossum, Z., Hill, S., & Rothaizer, S. (2013, July). Virtuous cycles of learning: using formative, embedded, and diagnostic developmental assessments in a large-scale leadership program. Proceedings from ITC, Berkeley, CA.

Tett, R. P., Jackson, D. N., & Rothstein, M. (1991). Personality measures as predictors of job performance: A meta-analytic review. Personnel Psychology, 44, 703-742.

Zeidner, M., Matthews, G., & Roberts, R. D. (2004). Emotional intelligence in the workplace: A critical review. Applied psychology: An International Review, 53(3), 371-399.

Please follow and like us: