Archive for category standardized testing
Teacher pay and standardized test results
Posted by Theo in educational testing, standardized testing, teaching on November 23, 2009
At the end of October, the Century Foundation released a paper entitled, Eight reasons not to tie teacher pay to standardized test results. I agree with their conclusions, and would add that even if all standardized tests were extremely reliable and measured exactly what they intended to measure, this would be a bad idea. This is because success in the adult world requires a multiplicity of skills and forms of knowledge, and tests focus on only some of these, one at a time. Until we can construct multifaceted longitudinal stories about the progress of individual students that are tied to a non-arbitrary standardized metric, we should not even consider linking student evaluations to teacher pay.
Reliability 2: How high should it be?
Posted by Theo in educational testing, standardized testing, testing in general on July 5, 2009
There is a great deal of confusion in the assessment community about the interpretation of statistical reliability. This confusion results in part from the different ways in which researchers and test developers approach the issue. Researchers learn how to design research instruments which they use to study population trends or compare groups. They evaluate the quality of their instruments with statistics. One of the statistics used is Cronbach’s Alpha, an indicator of statistical reliability that ranges from 0 to 1. Researchers are taught that Alphas above .77 or so are acceptable for their instruments, because this level of reliability ensures that their instrument is measuring real differences between people.
Test developers use a special branch of statistics called psychometrics to build assessments. Assessments are designed to evaluate individuals. Like researchers, test developers are concerned about reliability, but for somewhat different reasons. From a psychometric point of view, it is not enough to know that an assessment measures real differences between people. Psychometricians need to be confident that the score awarded to an individual is a good estimate of that particular individual’s true score. Because of this, most psychometricians set higher standards for reliability than those set by researchers.
The table below will help to clarify why it is important for assessments to have higher reliabilities than research instruments. It shows the relationship between statistical reliability and the number of distinct levels (strata) a test can be said to have. For example, an assessment with a reliability of .80, has 3 strata, whereas an assessment with a reliability of .94 has 5.
| Reliability | Strata |
| .70 | 2 |
| .80 | 3 |
| .90 | 4 |
| .94 | 5 |
| .96 | 7 |
| .97 | 8 |
| .98 | 9 |
Strata have direct implications for the confidence we can have in a specific person’s score on a given assessment, because they tell us something about the range within which a person’s true score would fall, given a particular score. Imagine that you have taken a test with a scoring range of 0 to 500 and a reliability of .94. The number of strata into which this assessment can be divided is 5, which means that each strata equals about 100 points on the 500 point scale. If your score on this test is 350, your true score is likely to fall within the range of 300 to 400*.
Statistical reliability is only one of the ways in which assessments should be evaluated. Test developers should also ask how well an assessment measures what it is intended to measure. And those who use an assessment should ask whether or not what it measures is relevant or important.
*This range will be wider at the top and bottom of the scoring range and a bit narrower in the middle of the range.
References
Guilford J. P. (1965). Fundamental statistics in psychology and education. 4th Edn. New York: McGraw-Hill.
Kubiszyn T., Borich G. (1993). Educational testing and measurement. New York: Harper Collins.
Wright B. D. (1996). Reliability and separation. Rasch Measurement Transactions, 9, 472.
Motivation & standardized testing
Posted by Theo in educational testing, motivation, standardized testing, teaching on April 25, 2009
Check out this post at Docere est Discere (Musings on language and teaching).
Predicting trends, testing people
Posted by Theo in educational testing, standardized testing, testing in general on April 15, 2009
Mark Forman, in his response to the post entitled, IQ and development, wrote about the difference between predicting trends and testing individuals. I agree that people, including many academics, do not understand the difference between using assessments to predict trends and using assessments to make judgments about individuals. There are two main issues: First, as Mark argues, questions of validity differ, depending upon whether we are looking at individuals or population trends. If we are looking at trends, determining predictive validity is a simple matter of determining if an assessment helps an institution make more successful decisions than it was able to make without the assessment. However, if a test is intended to be useful to individuals (aid in their learning, help them determine what to learn next, help them find the best place to learn, help them decide what profession to pursue, etc.), predictive validity cannot be determined by examining trends. In this case, the predictive validity of an assessment should be evaluated in terms of how well it predicts what individual test-takers can most benefit from learning next, where they can learn it, or what kind of employment they should seek—as individuals.
The second issue concerns reliability. Especially in the adult assessment field, researchers often do not understand that the levels of statistical reliability considered acceptable for studies of population trends are far from adequate for making judgments about individuals. Many of the adult assessments that are on the market today have been developed by researchers who do not understand the reliability criteria for assessments used to test individuals*. As a consequence, the reliability of these assessments is often so low that we cannot be confident that a score on a given assessment is truly different from any other score on that assessment.
*Unfortunately, there is no magic reliability number. But here are some general guidelines. The absolute minimum statistical reliability for an assessment that claims to distinguish two or three levels of performance is an alpha of .85. To claim up to 6 levels, you need an alpha of .95. You will also want to think about the meaning of these distinctions between levels in terms of confidence intervals. A confidence interval is the range in which an individual’s true score is most likely to fall. For example, in the case of Lectical™ assessments, the statistical reliabilities we have calculated over the last 10 years indicate that the confidence interval around Lectical scores is generally around 1/4 of a level (a phase).
Advice: If statistical reliability is not reported (preferably in a peer reviewed article), don’t use the test.
Testing the limits of testing
Posted by Theo in educational testing, standardized testing on April 6, 2009
The NTS is an interactive online survey that asks about (1) the legitimate purposes of testing and (2) how well today’s tests serve these purposes. In addition to completing a set of survey questions, respondents are offered an opportunity to write about their personal experiences with testing and share alternative testing resources. When respondents have completed the survey, they can view their results and compare them to national averages. Anyone who visits the site can read respondents’ stories, explore the resources, and track national results. Please participate in the NTS, and use your email lists and social networks to spread the word! Feel free to circulate the NTS poster or the poster announcing the NTS launch event. Contact Zachary Stein if you have questions or would like to become involved.
NTS launch event: Testing the limits of testing
Thursday, May 28th, 2009, 4:00 – 5:30 pm
Zachary Stein, Marc Schwartz, and Theo L. Dawson
The launch event will occur just prior to the opening of the second annual conference of the International Mind, Brain, and Education Society (IMBES) at the Sheraton Society Hill Hotel in Philadelphia, Pennsylvania. At this event, speakers will present preliminary data from the NTS, examine the limits of current test development methods, and explore new approaches to assessment, incorporating the perspectives of stakeholder groups who have participated in the survey so far.
More information is available on the NTS site.
Admission to the launch is FREE and open to the public, but space is limited. To attend, you must obtain a ticket from the NTS web site.
The conference will also feature a workshop on testing:
Educational testing for the 21st century: Challenges, models, and solutions
10:45 – 3:45, Saturday, May 30
Kurt Fischer, Marc Schwartz, Theo Dawson, Zachary Stein
The most basic form of educational testing takes the form of a “conversation” between an individual student and a teacher in which the student reveals what he or she is most likely to benefit from learning next. This kind of conversation increasingly takes a back seat to standardized forms of assessment that are designed to rank students for purposes that are dissociated from learning itself. Testing has lost its roots. The statistically generated rankings of standardized tests tell us very little about the specific learning needs of individual students. And it is becoming increasingly apparent that the kind of knowledge required to succeed on a typical standardized test bears little resemblance to the kind of knowledge required for adult life. The challenge we now face is creating the kind of mass-customization that revives the educative role of assessments in the local dialogue between teachers, students, and the curriculum, while maintaining the advantages of standardization. Simply stated: we need tests that help teachers meet the learning needs of individual students–tests teachers ought to teach to. In this workshop, we explore perspectives on these issues from the classroom, cognitive developmental science, psychometrics, and philosophy and offer a concrete vision for the future of assessment. The workshop is intended for educators, administrators, researchers, and policy makers. It is FREE to those who register for the entire IMBES conference. If you are interested in attending only the workshop, the fee is $80 before April 28th, and $95 after April 28th.
Test validity (part 1)
Posted by Theo in learning, standardized testing, testing in general on March 23, 2009
If a test is (1) measuring what it intends to measure (construct validity) and (2) what it is measuring is of value (ecological validity), it is considered to be a valid test. Sounds pretty straightforward, but it’s not. That’s partly because these two categories of validity often compete with one another, and it is a challenge to find the right balance.
For example, it seems pretty obvious that math items should be about math and reading comprehension items should be about reading comprehension. So, to make sure a math test has construct validity—is about math—you ought to limit the amount of reading required to understand your test items, right?
But what if what you really want to know is how students tackle real-world math problems, which often require the ability to understand the context in which mathematical problems are encountered. After all, there are good reasons to think that a skill a student can apply in real-world contexts is superior to a skill a student can only exhibit on a test that is stripped of context. If you followed this line of reasoning and composed your test of questions that reflect how knowledge is used in the world outside of the classroom, it would have ecological validity.
However, while including context in your math test would increase its ecological validity, doing so would increase the risk of reducing its construct validity by making it less clear exactly what is being measured. This might be reflected in lowered scores for students who can do math but aren’t good readers or are unfamiliar with the kind of situations described in test questions. A result like this can look a lot like discrimination—especially when the stakes are high.
In sum, the more you strip away context, the more you risk lowering ecological validity. The more context you add, the more you risk lowering construct validity. Today, there is a strong tendency to prioritize construct validity over ecological validity, primarily because the stakes of many tests are very high, which increases our focus on anything that seems to interfere with fairness. Without intending to, test developers, policy-makers, parents, and teachers have contributed to the creation of tests with decreasing ecological validity—and there is no doubt that teachers are teaching to these tests. The implication? What students are learning in our public schools is increasingly irrelevant.
This is a cause for concern.
Reliability 1: Confidence in test scores
Posted by Theo in standardized testing, testing in general on March 21, 2009
How do you know if you can have confidence in the score you get on a test?
When you measure the height of a table, you can be pretty confident that the measurement you make is correct. Rulers are well-calibrated measures that we can use with great confidence if we use them correctly. Scores on tests are not like points on a ruler. They always have bands around them called confidence intervals. A confidence interval is the range around your score in which your “true ability” is likely to reside. Usually, a confidence interval represents a likelihood somewhere between 70% and 95%.
The overall level of confidence we can have in a test score is represented in the test’s statistical reliability. (A reliability of 1 is perfect.) As a general rule, no test that is used to evaluate individuals (as opposed to group trends) should have a statistical reliability below .85. Also, the higher the stakes, the higher the reliability. For example, the SAT and GRE have reliabilities in the .95 range.
There is a close relation between confidence intervals and reliability. If you place a series of 95% confidence intervals end-to-end along the scale of a really good standardized test—imagine putting pieces of string end to end along the length of a ruler—you won’t be able to fit more than 4 to 6 of them on the scale without allowing them to overlap. This means that the test can distinguish only 4 to 6 truly different levels of performance.
So, why do scores get reported on scales that span more than 4 to 6 levels?
Test accuracy
Posted by Theo in standardized testing, testing in general on March 21, 2009
How do you know if the score you get on a test is accurate?
This depends on what you mean by accurate. If you mean, “Can I be sure that the score I receive on a test is an accurate representation of my performance?” all you need to know is if the test was scored accurately.
However, if you mean, “Can I be sure that the score I receive on a test is an accurate representation of my true abilities, competence, attitudes, dispositions, or opinions?” you can’t. It is impossible for a single test, no matter how good it is, to guarantee an accurate assessment of any of these things. In fact, it takes multiple assessments and multiple kinds of assessments to build up anything like an accurate picture.
I think what test developers do is a black box for most of us. We tend to assume that there is some kind of special insight test developers have by virtue of their psychometric tools (or knowledge about learning) that allows them to get inside of our minds and make accurate judgments about what’s going on in there. But there isn’t. All a single test can measure is performance on the items on that test. The rest is inference.
Realizing this, you may wonder why single scores on single tests are being used to make high stakes decisions about anything.
Making standardized tests
Posted by Theo in standardized testing on March 20, 2009
There are three main players in the creation of most standardized tests. They are the (1) discipline experts, (2) item developers, and (3) psychometricians. The discipline experts are usually PhD’s who specialize in particular areas—like science, math, writing, or history. They know a lot about their content areas and have done research on teaching and learning in these areas. They also may be teachers of teachers.
A group of discipline experts work together to decide what material should be covered in lessons and on tests.* Discipline experts set standards through organizations like the National Research Council, and may or may not be affiliated with test developers.
The item developers create test questions. They usually have a bachelor or masters degree in a particular subject area. Many have not taught, and few are experts in learning and development. Item developers design test questions that cover the content of the standards. Almost all of the items designed by item developers are multiple choice, which means that they have right and wrong answers, and thus, must focus on “factual” knowledge.
The third players are psychometricians. They put together groups of items and examine how well these work together to measure students’ knowledge of the subject at hand. Generally, psychometricians know relatively little about learning and development and do not work closely with item developers or discipline experts.
Although discipline experts may include skills for thinking and learning in their standards, these skills are not measured on standardized tests, because they cannot be evaluated with multiple choice items. And although discipline experts may focus on student understanding in their standards, research has shown that up to 50% of the students who get a multiple choice item correct cannot demonstrate understanding by providing an adequate explanation of their answer.
Cognitive psychologists know about the problems that stem from how standardized tests are made. Many of them are not fans of standardized tests, partly because of the limitations of multiple choice items, partly because the tests are inadequately grounded in evidence about how students actually learn concepts and skills, and partly because the tests push teachers to emphasize breadth over depth and memorization of facts over skills for thinking and learning.
In future posts, after explaining a bit more about how tests work, I will examine an alternative testing model based on research into how students actually learn concepts and skills over time.
Recent Comments