Posts Tagged statistical reliability
Maintaining inter-rater agreement
Posted by Theo in Lectical Assessment System, cognitive development, measurement on April 24, 2010
How we maintain inter-rater agreement and ensure high reliability at DTS/DiscoTest
First, we design assessments with 5-7 essay questions, partly because this number is required to allow us to achieve a level of reliability that allows us to identify 4 phases per lectical level. This corresponds with a corrected alpha of .95 or greater.
Second, we engage in continuous learning. Certified analysts and trainees attend mandatory weekly scoring meetings (called scoring circles) where they discuss scoring and review challenging cases.
Third, when we begin working with data from a new subject area, the scoring circle always examines a diverse sample of protocols before starting to score in earnest. Then, when we begin scoring a new assessment, two Certified Analysts score every performance until agreement rates are consistently at or above 85% within 1/4 of a level.
Fourth, we second score a percentage of all performances, some selected at random and some selected because the first analyst lacks confidence in his or her score.
- 5%-10% of all assessments, selected at random, are second-scored by a blind analyst (a higher percentage on newer assessments or when the rate of inter-rater agreement is unacceptable.)
- A second, blind scorer is required to score an assessment any time the first scorer’s confidence level is below the level we call “confident”.
When the scores of the first and second scorers are different by more than 1 phase, first and second scorers must reconcile through discussion. If they cannot reconcile, they must consult a third Certified Analyst.
Confidence levels
4 = very confident: exemplary, prototypical
3 = confident: no guesswork, not too much variation, no more than 2 responses where scorer wavers, no lack of coherence, no language problems, adequate explanation, no suspicion of plagarism, not idiosyncratic
2 = less than confident: guesswork, too much variation, more than 2 responses where scorer wavers, lack of coherence, language problems, inadequate explanation, suspicion of plagarism, idiosyncratic
1 = not confident at all: unscorable or almost unscorable, very idiosyncratic, very incoherent
Reliability 2: How high should it be?
Posted by Theo in educational testing, standardized testing, testing in general on July 5, 2009
There is a great deal of confusion in the assessment community about the interpretation of statistical reliability. This confusion results in part from the different ways in which researchers and test developers approach the issue. Researchers learn how to design research instruments which they use to study population trends or compare groups. They evaluate the quality of their instruments with statistics. One of the statistics used is Cronbach’s Alpha, an indicator of statistical reliability that ranges from 0 to 1. Researchers are taught that Alphas above .77 or so are acceptable for their instruments, because this level of reliability ensures that their instrument is measuring real differences between people.
Test developers use a special branch of statistics called psychometrics to build assessments. Assessments are designed to evaluate individuals. Like researchers, test developers are concerned about reliability, but for somewhat different reasons. From a psychometric point of view, it is not enough to know that an assessment measures real differences between people. Psychometricians need to be confident that the score awarded to an individual is a good estimate of that particular individual’s true score. Because of this, most psychometricians set higher standards for reliability than those set by researchers.
The table below will help to clarify why it is important for assessments to have higher reliabilities than research instruments. It shows the relationship between statistical reliability and the number of distinct levels (strata) a test can be said to have. For example, an assessment with a reliability of .80, has 3 strata, whereas an assessment with a reliability of .94 has 5.
| Reliability | Strata |
| .70 | 2 |
| .80 | 3 |
| .90 | 4 |
| .94 | 5 |
| .96 | 7 |
| .97 | 8 |
| .98 | 9 |
Strata have direct implications for the confidence we can have in a specific person’s score on a given assessment, because they tell us something about the range within which a person’s true score would fall, given a particular score. Imagine that you have taken a test with a scoring range of 0 to 500 and a reliability of .94. The number of strata into which this assessment can be divided is 5, which means that each strata equals about 100 points on the 500 point scale. If your score on this test is 350, your true score is likely to fall within the range of 300 to 400*.
Statistical reliability is only one of the ways in which assessments should be evaluated. Test developers should also ask how well an assessment measures what it is intended to measure. And those who use an assessment should ask whether or not what it measures is relevant or important.
*This range will be wider at the top and bottom of the scoring range and a bit narrower in the middle of the range.
References
Guilford J. P. (1965). Fundamental statistics in psychology and education. 4th Edn. New York: McGraw-Hill.
Kubiszyn T., Borich G. (1993). Educational testing and measurement. New York: Harper Collins.
Wright B. D. (1996). Reliability and separation. Rasch Measurement Transactions, 9, 472.
Predicting trends, testing people
Posted by Theo in educational testing, standardized testing, testing in general on April 15, 2009
Mark Forman, in his response to the post entitled, IQ and development, wrote about the difference between predicting trends and testing individuals. I agree that people, including many academics, do not understand the difference between using assessments to predict trends and using assessments to make judgments about individuals. There are two main issues: First, as Mark argues, questions of validity differ, depending upon whether we are looking at individuals or population trends. If we are looking at trends, determining predictive validity is a simple matter of determining if an assessment helps an institution make more successful decisions than it was able to make without the assessment. However, if a test is intended to be useful to individuals (aid in their learning, help them determine what to learn next, help them find the best place to learn, help them decide what profession to pursue, etc.), predictive validity cannot be determined by examining trends. In this case, the predictive validity of an assessment should be evaluated in terms of how well it predicts what individual test-takers can most benefit from learning next, where they can learn it, or what kind of employment they should seek—as individuals.
The second issue concerns reliability. Especially in the adult assessment field, researchers often do not understand that the levels of statistical reliability considered acceptable for studies of population trends are far from adequate for making judgments about individuals. Many of the adult assessments that are on the market today have been developed by researchers who do not understand the reliability criteria for assessments used to test individuals*. As a consequence, the reliability of these assessments is often so low that we cannot be confident that a score on a given assessment is truly different from any other score on that assessment.
*Unfortunately, there is no magic reliability number. But here are some general guidelines. The absolute minimum statistical reliability for an assessment that claims to distinguish two or three levels of performance is an alpha of .85. To claim up to 6 levels, you need an alpha of .95. You will also want to think about the meaning of these distinctions between levels in terms of confidence intervals. A confidence interval is the range in which an individual’s true score is most likely to fall. For example, in the case of Lectical™ assessments, the statistical reliabilities we have calculated over the last 10 years indicate that the confidence interval around Lectical scores is generally around 1/4 of a level (a phase).
Advice: If statistical reliability is not reported (preferably in a peer reviewed article), don’t use the test.
Reliability 1: Confidence in test scores
Posted by Theo in standardized testing, testing in general on March 21, 2009
How do you know if you can have confidence in the score you get on a test?
When you measure the height of a table, you can be pretty confident that the measurement you make is correct. Rulers are well-calibrated measures that we can use with great confidence if we use them correctly. Scores on tests are not like points on a ruler. They always have bands around them called confidence intervals. A confidence interval is the range around your score in which your “true ability” is likely to reside. Usually, a confidence interval represents a likelihood somewhere between 70% and 95%.
The overall level of confidence we can have in a test score is represented in the test’s statistical reliability. (A reliability of 1 is perfect.) As a general rule, no test that is used to evaluate individuals (as opposed to group trends) should have a statistical reliability below .85. Also, the higher the stakes, the higher the reliability. For example, the SAT and GRE have reliabilities in the .95 range.
There is a close relation between confidence intervals and reliability. If you place a series of 95% confidence intervals end-to-end along the scale of a really good standardized test—imagine putting pieces of string end to end along the length of a ruler—you won’t be able to fit more than 4 to 6 of them on the scale without allowing them to overlap. This means that the test can distinguish only 4 to 6 truly different levels of performance.
So, why do scores get reported on scales that span more than 4 to 6 levels?
Recent Comments