There is a great deal of confusion in the assessment community about the interpretation of test reliability. This confusion results in part from the different ways in which researchers and test developers approach the issue.
Researchers learn how to design research instruments which they use to study population trends or compare groups. They evaluate the quality of their instruments with statistics. One of the statistics used is Cronbach's Alpha, an indicator of statistical reliability that ranges from 0 to 1. Researchers are taught that Alphas above .75 or so are acceptable for their instruments, because this level of reliability ensures that their instrument is measuring real differences between people.
Test developers use a special branch of statistics called psychometrics to build assessments. Assessments are designed to evaluate individuals. Like researchers, test developers are concerned about reliability, but for somewhat different reasons. From a psychometric point of view, it's not enough to know that an assessment measures real differences between people. Psychometricians need to be confident that the score awarded to an individual is a good estimate of that particular individual's true score. Because of this, most psychometricians set higher standards for reliability than those set by researchers.
The table below will help to clarify why it is important for assessments to have higher reliabilities than research instruments. It shows the relationship between statistical reliability and the number of distinct levels (strata) a test can be said to have. For example, an assessment with a reliability of .80, has 3 strata, whereas an assessment with a reliability of .94 has 5.
Strata have direct implications for the confidence we can have in a specific person's score on a given assessment, because they tell us about the range within which a person's true score would fall, given a particular score.
Imagine that you have just taken a test of emotional intelligence with a score range of 1 to 10 and a reliability of .95. The number of strata into which this assessment can be divided is about 6, which means that each strata equals about 1.75 points on the 10 point scale. If your score on this test is 8, your true score is likely to fall within the range of 7.1 to 8.9*.
The figure below shows the true score ranges for three test takers, CB, RM, and PR. The fact that these ranges don't overlap gives us confidence that the emotional intelligence of these test-takers is actually different**.
Tests with Alphas in the range of .94 or higher are considered suitable for high-stakes use (assuming that they meet other essential validity requirements).
If the test you have taken has a score range of 1 to 10 and an Alpha (reliability) of .85, the number of strata into which this assessment can be divided is about 3.4, which means that each strata equals about 2.9 points on the 10 point scale. This means your true score is likely to fall within the range of 6.6 to 9.5*.
In the figure below, note that CB's true score range now overlaps RM's true score range and RM's true score range overlaps PR's true score range. This means we cannot say—with confidence—that CB's score is different from RM's score, or that RM's score is different from PR's score.
Assessments with Alphas in the .85 range are suitable for classorrom use or low-stakes contexts. Yet, every day, schools and businesses use tests with reliablilites in the .85 range to make high stakes decisions—such as who will be selected for advancement or promotion.
If the test you have taken has a score range of 1 to 10 and an Alpha (reliability) of .85, the number of strata into which this assessment can be divided is about 2.2, which means that each strata equals about .45 points on the 10 point scale. This means your true score is likely to fall within the range of 6 to 10*.
As shown in the figure below, scores would now have to differ by at least 4.5 points in order for us to distinguish between two people. CB's and PR's scores are different, but RM's score is uninterpretable.
Tests or subscales with alphas in the .75 range are considered suitable for research purposes. Yet, sad to say, schools and businesses now use tests with subscales that have Alphas in or below the .75 range, treating these scores as if they provide useful information, when in most cases the scores—like RM's—are uninterpretable.
If your current test provider is not reporting true score ranges, ask for them. If they only provide Alphas (reliability statistics) you can use the table and figures in this article to figure out true score ranges for yourself.
Be particulary wary about test developers that claim to measure multiple diminsions with 10-15 minute tests. It is not possible to detect individual differences reliably under these conditions.
Statistical reliability is only one of the ways in which assessments should be evaluated. Test developers should also ask how well an assessment measures what it is intended to measure. And those who use an assessment should ask whether or not what it measures is relevant or important.
*This range will be wider at the top and bottom of the scoring range and a bit narrower in the middle of the range.
**It doesn't tell us if emotional intellignece is important. That is determined in other ways.
Guilford J. P. (1965). Fundamental statistics in psychology and education. 4th Edn. New York: McGraw-Hill.
Kubiszyn T., Borich G. (1993). Educational testing and measurement. New York: Harper Collins.
Wright B. D. (1996). Reliability and separation. Rasch Measurement Transactions, 9, 472.