Posts Tagged psychometrics
What is a holistic assessment?
Posted by Theo in educational testing, measurement, research, testing in general on December 11, 2009
Thirty years ago, when I was a hippy midwife, the idea of holism began to slip into the counter-culture. A few years later, this much misunderstood notion was all the rage on college campuses. By the time I was in graduate school in the nineties there was a impassable division between the trendy postmodern holists and the rigidly old fashioned modernists. You may detect a slight mocking tone, and rightly so. People with good ideas on both sides made themselves look pretty silly by refusing, for example, to use any of the tools associated with the other side. One of the more tragic outcomes of this silliness was the emergence of the holistic assessment.
Simply put, the holistic assessment is a multidimensional assessment that is designed to take a more nuanced, textured, or rich approach to assessment. Great idea. Love it.
It’s the next part that’s silly. Having collected rich information on multiple dimensions, the test designers sum up a person’s performance with a single number. Why is this silly? Because the so-called holistic score becomes pretty-much meaningless. Two people with the same score can have very little in common. For example, let’s imagine that a holistic assessment examines emotional maturity, perspective taking, and leadership thinking. Two people receive a score of 10 that may be accompanied by boilerplate descriptions of what emotional maturity, perspective taking, and leadership attitudes look like at level 10. However, person one was actually weak in perspective-taking and strongest in leadership, and person two was weak in emotional maturity and strongest in perspective taking. The score of 10, it turns out, means something quite different for these two people. I would argue that it is relatively meaningless because there is no way to know, based on the single “holistic” score, how best to support the development of these distinct individuals.
Holism has its roots in system dynamics, where measurements are used to build rich models of systems. All of the measurements are unidimensional. They are never lumped together into “holistic” measures. That would be equivalent to talking about the temperaturelength of a day or the lengthweight of an object*. It’s essential to measure time, weight, and length with appropriate metrics and then to describe their interrelationships and the outcomes of these interrelationships. The language used to describe these is the language of probability, which is sensitive to differences in the measurement of different properties.
In psychological assessment, dimensionality is a challenging issue. What constitutes a single dimension is a matter for debate. For DTS, the primary consideration is how useful an assessment will be in helping people learn and grow. So, we tend to construct individual assessments, each of which represents a fairly tightly defined content space, and we use only one metric to determine the level of a performance. The meaning of a given score is both universal (it is an order of hierarchical complexity and phase on the skill scale) and contextual (it is provided to a performance in a particular domain in a particular context, and is associated with particular content.) We independently analyze the content of the performance to determine its strengths and weaknesses—relative to its level and the known range of content associated with that level—and provide feedback about these strengths and weaknesses as well as targeted learning suggestions. We use the level score to help us tell a useful story about a particular performance, without claiming to measure “lenghtweight”. This is accomplished by the rigorous separation of structure (level) and content.
*If we described objects in terms of their lengthweight, an object that was 10 inches long and 2 lbs could have a lengthweight of 12, but so could an object that was 2 inches long and 10 lbs.
About measurement
Posted by Theo in Lectical Assessment System, cognitive development, measurement on July 29, 2009
The story of how measurement permits scientific advance can be illustrated through any number of examples. One such example is the measurement of temperature and its effects on our understanding of the molecular structure of lead and other elemental substances.
The tale begins with an assortment of semi-mythical early scientists, who agreed in their observations that lead only melts when it is very hot—much hotter than the temperature at which ice melts, and quite a bit cooler than the temperature at which iron melts. These observations, made repeatedly, resulted in the hypothesis that lead melts at a particular temperature.
To test this theory it was necessary to develop a standard for measuring temperature. A variety of early thermometers were developed and implemented. Partly because these early temperature-measuring devices were poorly calibrated, and partly because different temperature-measuring devices employed different scales, the temperature at which lead melted seemed to vary from device to device and context to context.
Scientists divided into a number of ‘camps’. One group argued that there were multiple pathways toward melting, which explained why the melting seemed to occur at different temperatures. Another group argued that the melting of lead could not be understood apart from the context in which the melting occurs. Only when a measure of temperature had been adequately developed and widely accepted did it become possible to observe that lead consistently melts at about 327º C.
Armed with this knowledge, scientists asked what it is about lead that causes it to melt at this particular temperature. They then developed hypotheses about the factors contributing to this phenomenon, observing that changes in altitude or air pressure seemed to result in small differences in its melting temperature. So, context did seem to play a role! In order to observe these differences more accurately, the measurement of temperature was further refined. The resulting observations provided information that ultimately contributed to an understanding of lead’s and other elements’ molecular structure.
While parts of this story are fictional, it is true that the thermometer has greatly contributed to our understanding of the properties of lead. Interestingly, the thermometer, like all other measures, emerged from what were originally qualitative observations about the effects of different amounts of heat that were quantified over time. The value of the thermometer, as we all know, extends far beyond its use as a measure of the melting temperature of lead. The thermometer is a measure of temperature in general, meaning that it can be employed to measure temperature in an almost limitless range of substances and contexts. It is this generality, in the end, that makes it possible to investigate the impact of context on the melting temperature of a substance, or to compare the relative melting temperatures of a range of elemental substances. This generality (or context-independence) is one of the primary features of a good measure.
Good measurement requires (1) the identification of a unidimensional, content and context-independent trait (temperature, length, time); (2) a system for assessing the amount of the trait; (3) determinations of the reliability and validity of the assessments; and finally (4) the calibration of a measure. A good thermometer has all of the qualities of a good measure. It is a well-calibrated instrument that can be employed to accurately and reliably measure a general, unidimensional trait across a wide range of contexts.
It was this perspective on measurement that first inspired me to try to find a good general measure of the developmental dimension. To read more about how this way of thinking relates to the Lectical Assessment System (LAS), read About Measurement on the DTS site. Pay special attention to the list of things we can do with the LAS.
Reliability 2: How high should it be?
Posted by Theo in educational testing, standardized testing, testing in general on July 5, 2009
There is a great deal of confusion in the assessment community about the interpretation of statistical reliability. This confusion results in part from the different ways in which researchers and test developers approach the issue. Researchers learn how to design research instruments which they use to study population trends or compare groups. They evaluate the quality of their instruments with statistics. One of the statistics used is Cronbach’s Alpha, an indicator of statistical reliability that ranges from 0 to 1. Researchers are taught that Alphas above .77 or so are acceptable for their instruments, because this level of reliability ensures that their instrument is measuring real differences between people.
Test developers use a special branch of statistics called psychometrics to build assessments. Assessments are designed to evaluate individuals. Like researchers, test developers are concerned about reliability, but for somewhat different reasons. From a psychometric point of view, it is not enough to know that an assessment measures real differences between people. Psychometricians need to be confident that the score awarded to an individual is a good estimate of that particular individual’s true score. Because of this, most psychometricians set higher standards for reliability than those set by researchers.
The table below will help to clarify why it is important for assessments to have higher reliabilities than research instruments. It shows the relationship between statistical reliability and the number of distinct levels (strata) a test can be said to have. For example, an assessment with a reliability of .80, has 3 strata, whereas an assessment with a reliability of .94 has 5.
| Reliability | Strata |
| .70 | 2 |
| .80 | 3 |
| .90 | 4 |
| .94 | 5 |
| .96 | 7 |
| .97 | 8 |
| .98 | 9 |
Strata have direct implications for the confidence we can have in a specific person’s score on a given assessment, because they tell us something about the range within which a person’s true score would fall, given a particular score. Imagine that you have taken a test with a scoring range of 0 to 500 and a reliability of .94. The number of strata into which this assessment can be divided is 5, which means that each strata equals about 100 points on the 500 point scale. If your score on this test is 350, your true score is likely to fall within the range of 300 to 400*.
Statistical reliability is only one of the ways in which assessments should be evaluated. Test developers should also ask how well an assessment measures what it is intended to measure. And those who use an assessment should ask whether or not what it measures is relevant or important.
*This range will be wider at the top and bottom of the scoring range and a bit narrower in the middle of the range.
References
Guilford J. P. (1965). Fundamental statistics in psychology and education. 4th Edn. New York: McGraw-Hill.
Kubiszyn T., Borich G. (1993). Educational testing and measurement. New York: Harper Collins.
Wright B. D. (1996). Reliability and separation. Rasch Measurement Transactions, 9, 472.
IQ and development
Posted by Theo in cognitive development, educational testing, testing in general on April 14, 2009
IQ is a dimension of ability that has been defined using a form of statistical modeling called psychometrics. It is based entirely on psychometric analysis of results from tests consisting of many items, each of which has one correct answer.
IQ scores are arranged along a scale that is based upon the performances of hundreds of people who have taken the same test.
IQ is considered to be a relatively fixed characteristic of a person. People who score higher on an IQ test are considered to be more intelligent than people who score lower.
Cognitive development is a theoretically defined, evidence based dimension. Developmental level is determined by asking individuals to engage in activities that expose their reasoning. Items on developmental assessments are typically open-ended and do not focus on correct answers. They focus on how people go about seeking answers.
A single developmental dimension has been shown to underlie development in a wide range of cognitive domains, making it possible to define a non-arbitrary scale along which development progresses. Individual performances can be placed within a range on this scale.
Cognitive developmental level is not viewed as a fixed trait and is known to vary within persons, depending on knowledge area and a range of contextual variables. Individuals who demonstrate higher levels of cognitive development are viewed as more cognitively developed than those demonstrating lower levels of cognitive development.
The relation between IQ and cognitive development
Children with higher IQ’s learn the kind of knowledge and skills represented in IQ tests earlier than people with lower IQ’s. There is some evidence that cognitive development is likely to be more rapid (and have a higher “endpoint”) in people who have higher IQ’s.
Limitations of testing
The subject matter of IQ tests is limited, and the skill sets that are tested are narrow, so we have to be careful about making generalizations about people based on test results—especially the results of single tests. The same is true for cognitive developmental assessments. Good cognitive developmental assessments are now providing scores with a level of precision similar to that of conventional assessments, but even the most precise and accurate scores apply to performance on a single assessment in a single subject area, and do not capture the full range of capabilities of a test-taker.
The inability of any single assessment (or type of assessment) to provide an accurate account of the capabilities of an individual suggests that the best (most ethical) use of assessments involves repeated measurements across a wide range of subject areas over time.
Test validity (part 1)
Posted by Theo in learning, standardized testing, testing in general on March 23, 2009
If a test is (1) measuring what it intends to measure (construct validity) and (2) what it is measuring is of value (ecological validity), it is considered to be a valid test. Sounds pretty straightforward, but it’s not. That’s partly because these two categories of validity often compete with one another, and it is a challenge to find the right balance.
For example, it seems pretty obvious that math items should be about math and reading comprehension items should be about reading comprehension. So, to make sure a math test has construct validity—is about math—you ought to limit the amount of reading required to understand your test items, right?
But what if what you really want to know is how students tackle real-world math problems, which often require the ability to understand the context in which mathematical problems are encountered. After all, there are good reasons to think that a skill a student can apply in real-world contexts is superior to a skill a student can only exhibit on a test that is stripped of context. If you followed this line of reasoning and composed your test of questions that reflect how knowledge is used in the world outside of the classroom, it would have ecological validity.
However, while including context in your math test would increase its ecological validity, doing so would increase the risk of reducing its construct validity by making it less clear exactly what is being measured. This might be reflected in lowered scores for students who can do math but aren’t good readers or are unfamiliar with the kind of situations described in test questions. A result like this can look a lot like discrimination—especially when the stakes are high.
In sum, the more you strip away context, the more you risk lowering ecological validity. The more context you add, the more you risk lowering construct validity. Today, there is a strong tendency to prioritize construct validity over ecological validity, primarily because the stakes of many tests are very high, which increases our focus on anything that seems to interfere with fairness. Without intending to, test developers, policy-makers, parents, and teachers have contributed to the creation of tests with decreasing ecological validity—and there is no doubt that teachers are teaching to these tests. The implication? What students are learning in our public schools is increasingly irrelevant.
This is a cause for concern.
Reliability 1: Confidence in test scores
Posted by Theo in standardized testing, testing in general on March 21, 2009
How do you know if you can have confidence in the score you get on a test?
When you measure the height of a table, you can be pretty confident that the measurement you make is correct. Rulers are well-calibrated measures that we can use with great confidence if we use them correctly. Scores on tests are not like points on a ruler. They always have bands around them called confidence intervals. A confidence interval is the range around your score in which your “true ability” is likely to reside. Usually, a confidence interval represents a likelihood somewhere between 70% and 95%.
The overall level of confidence we can have in a test score is represented in the test’s statistical reliability. (A reliability of 1 is perfect.) As a general rule, no test that is used to evaluate individuals (as opposed to group trends) should have a statistical reliability below .85. Also, the higher the stakes, the higher the reliability. For example, the SAT and GRE have reliabilities in the .95 range.
There is a close relation between confidence intervals and reliability. If you place a series of 95% confidence intervals end-to-end along the scale of a really good standardized test—imagine putting pieces of string end to end along the length of a ruler—you won’t be able to fit more than 4 to 6 of them on the scale without allowing them to overlap. This means that the test can distinguish only 4 to 6 truly different levels of performance.
So, why do scores get reported on scales that span more than 4 to 6 levels?
Test accuracy
Posted by Theo in standardized testing, testing in general on March 21, 2009
How do you know if the score you get on a test is accurate?
This depends on what you mean by accurate. If you mean, “Can I be sure that the score I receive on a test is an accurate representation of my performance?” all you need to know is if the test was scored accurately.
However, if you mean, “Can I be sure that the score I receive on a test is an accurate representation of my true abilities, competence, attitudes, dispositions, or opinions?” you can’t. It is impossible for a single test, no matter how good it is, to guarantee an accurate assessment of any of these things. In fact, it takes multiple assessments and multiple kinds of assessments to build up anything like an accurate picture.
I think what test developers do is a black box for most of us. We tend to assume that there is some kind of special insight test developers have by virtue of their psychometric tools (or knowledge about learning) that allows them to get inside of our minds and make accurate judgments about what’s going on in there. But there isn’t. All a single test can measure is performance on the items on that test. The rest is inference.
Realizing this, you may wonder why single scores on single tests are being used to make high stakes decisions about anything.
Making standardized tests
Posted by Theo in standardized testing on March 20, 2009
There are three main players in the creation of most standardized tests. They are the (1) discipline experts, (2) item developers, and (3) psychometricians. The discipline experts are usually PhD’s who specialize in particular areas—like science, math, writing, or history. They know a lot about their content areas and have done research on teaching and learning in these areas. They also may be teachers of teachers.
A group of discipline experts work together to decide what material should be covered in lessons and on tests.* Discipline experts set standards through organizations like the National Research Council, and may or may not be affiliated with test developers.
The item developers create test questions. They usually have a bachelor or masters degree in a particular subject area. Many have not taught, and few are experts in learning and development. Item developers design test questions that cover the content of the standards. Almost all of the items designed by item developers are multiple choice, which means that they have right and wrong answers, and thus, must focus on “factual” knowledge.
The third players are psychometricians. They put together groups of items and examine how well these work together to measure students’ knowledge of the subject at hand. Generally, psychometricians know relatively little about learning and development and do not work closely with item developers or discipline experts.
Although discipline experts may include skills for thinking and learning in their standards, these skills are not measured on standardized tests, because they cannot be evaluated with multiple choice items. And although discipline experts may focus on student understanding in their standards, research has shown that up to 50% of the students who get a multiple choice item correct cannot demonstrate understanding by providing an adequate explanation of their answer.
Cognitive psychologists know about the problems that stem from how standardized tests are made. Many of them are not fans of standardized tests, partly because of the limitations of multiple choice items, partly because the tests are inadequately grounded in evidence about how students actually learn concepts and skills, and partly because the tests push teachers to emphasize breadth over depth and memorization of facts over skills for thinking and learning.
In future posts, after explaining a bit more about how tests work, I will examine an alternative testing model based on research into how students actually learn concepts and skills over time.
Recent Comments