What every buyer should know about forms of assessment

In this post, I'll be describing and comparing three basic forms of assessment—surveys, tests of factual and procedural knowledge, and performative tests.

Surveys—measures of perception, preference, or opinion

checklistWhat is a survey? A survey (a.k.a. inventory) is any assessment that asks the test-taker to choose from a set of options, such as "strongly agree" or "strongly disagree", based on opinion, preference, or perception. Surveys can be used by organizations in several ways. For example, opinion surveys can help maintain employee satisfaction by providing a "safe" way to express dissatisfaction before workplace problems have a chance to escalate.

Surveys have been used by organizations in a variety of ways. Just about everyone who's worked for a large organization has completed a personality inventory as part of a team-building exercise. The results stimulate lots of water cooler discussions about which "type" or "color" employees are, but their impact on employee performance is unclear. (Fair warning: I'm notorious for my discomfort with typologies!) Some personality inventories are even used in high stakes hiring and promotion decisions, a practice that continues despite evidence that they are very poor predictors of employee success [1].

survey_itemAlthough most survey developers don't pretend their assessments measure competence, many do. The item on the left was used in a survey with the words "management skills" in it's title.

Claims that surveys measure competence are most common when "malleable traits"—traits that are subject to change, learning or growth—are targeted. One example of a malleable trait is "EQ" or "emotional intelligence". EQ is viewed as a skill that can be developed, and there are several surveys that purport to measure its development. What they actually measure is attitude.

Another example of surveys masquerading as assessments of skill is in the measurement of "transformational learning". Transformational learning is defined as a learning experience that fundamentally changes the way a person understands something, yet the only way it appears to be measured is with surveys. Transformational learning surveys measure people's perceptions of their learning experience, not how much they are actually changed by it.

The only survey-type assessments that can be said to measure something like skill are assessments—such as 360s—that ask people about their perceptions. Although 360s inadvertently measure other things, like how much a person is liked or whether or not a respondent agrees with that person, they may also document evidence of behavior change. If what you are interested in is behavior change, a 360 may be appropriate in some cases, but it's important to keep in mind that while a 360 may measure change in a target's behavior, it's also likely to measure change in a respondent's attitude that's unrelated to the target's behavior.

360-type assessments may, to some extent, serve as tests of competence, because behavior change may be an indication that someone has learned new skills. When an assessment measures something that might be an indicator of something else, it is said to measure a proxy. A good 360 may measure a proxy (perceptions of behavior) for a skill (competence).

There are literally hundreds of research articles that document the limitations of surveys, but I'll mention only one more of them here: All of the survey types I've discussed are vulnerable to "gaming"—smart people can easily figure out what the most desirable answers are.

Surveys are extremely popular today because, relative to assessments of skill, they are inexpensive to develop and cost almost nothing to administer. Lectica gives away several high quality surveys for free because they are so inexpensive, yet organizations spend millions of dollars every year on surveys, many of which are falsely marketed as assessments of skill or competence.

Tests of factual and procedural knowledge

A test of competence is any test that asks the test taker to demonstrate a skill. Tests of factual and procedural knowledge can legitimately be thought of as tests of competence.

mc_itemThe classic multiple choice test examines factual knowledge, procedural knowledge, and basic comprehension. If you want to know if someone knows the rules, which formulas to apply, the steps in a process, or the vocabulary of a field, a multiple choice test may meet your needs. Often, the developers of multiple choice tests claim that their assessments measure understanding, reasoning, or critical thinking. This is because some multiple choice tests measure skills that are assumed to be proxies for skills like understanding, reasoning, and critical thinking. They are not direct tests of these skills.

Multiple choice tests are widely used, because there is a large industry devoted to making them, but they are increasingly unpopular because of their (mis)use as high stakes assessments. They are often perceived as threatening and unfair because they are often used to rank or select people, and are not helpful to the individual learner. Moreover, their relevance is often brought into question because they don't directly measure what we really care about—the ability to apply knowledge and skills in real-life contexts.

Performative tests

performative_itemTests that ask people to directly demonstrate their skills in (1) the real world, (2) real-world simulations, or (3) as they are applied to real-world scenarios are called performative tests. These tests usually do not have "right" answers. Instead, they employ objective criteria to evaluate performances for the level of skill demonstrated, and often play a formative role by providing feedback designed to improve performance or understanding. This is the kind of assessment you want if what you care about is deep understanding, reasoning skills, or performance in real-world contexts.

Performative tests are the most difficult tests to make, but they are the gold standard if what you want to know is the level of competence a person is likely to demonstrate in real-world conditions—and if you're interested in supporting development. Standardized performative tests are not yet widely used, because the methods and technology required to develop them are relatively new, and there is not yet a large industry devoted to making them. But they are increasingly popular because they support learning.

Unfortunately, performative tests may initially be perceived as threatening because people's attitudes toward tests of knowledge and skill have been shaped by their exposure to high stakes multiple choice tests. The idea of testing for learning is taking hold, but changing the way people think about something as ubiquitous as testing is an ongoing challenge.

Lectical Assessments

Lectical Assessments are performative tests—tests for learning. They are designed to support robust learning—the kind of learning that optimizes the growth of essential real-world skills. We're the leader of the pack when it comes to the sophistication of our methods and technology, our evidence base, and the sheer number of assessments we've developed.

[1] Frederick P. Morgeson, et al. (2007) Are we getting fooled again? Coming to terms with limitations in the use of personality tests for personnel selection, Personnel Psychology, 60, 1029-1033.

Please follow and like us:

The limitations of testing

It is important for those of us who use assessments to ensure that they (1) measure what we say they measure, (2) measure it reliably enough to justify claimed distinctions between and within persons, and (3) are used responsibly. It is relatively easy for testing experts to create assessments that are adequately reliable (2) for individual assessment, and although it is more difficult to show that these tests measure the construct of interest (1), there are reasonable methods for showing that an assessment meets this standard. However, it is more difficult to ensure that assessments are used responsibly (3).

Few consumers of tests are aware of their inherent limitations. Even the best tests, those that are highly reliable and measure what they are supposed to measure, provide only a limited amount of information. This is true of all measures. The more we hone in on a measureable dimension—in other words, the greater our precision becomes—the narrower the construct becomes. Time, weight, height, and distance are all extremely narrow constructs. This means that they provide a very specific piece of information extremely well. When we use a ruler, we can have great confidence in the measurement we make, down to very small lengths (depending on the ruler, of course). No one doubts the great advantages of this kind of precision. But we can’t learn anything else about the measured object. Its length usually cannot tell us what the object is, how it is shaped, its color, its use, its weight, how it feels, how attractive it is, or how useful it is. We only know how long it is. To provide an accurate account of the thing that was measured, we need to know many more things about it, and we need to construct a narrative that brings these things together in a meaningful way.

A really good psychological measure is similar. The LAS (Lectical Assessment System), for example, is designed to go to the heart of development, stripping away everything that does not contribute to the pure developmental “height” of a given performance. Without knowledge of many other things—such as the ways of thinking that are generally associated with this “height” in a particular domain, the specific ideas that are associated with this particular performance, information from other performances on other measures, qualitative observations, and good clinical judgment—we cannot construct a terribly useful narrative.

And this brings me to my final point: A formal measure, no matter how great it is, should always be employed by a knowledgeable mentor, clinician, teacher, consultant, or coach as a single item of information about a given client that may or may not provide useful insights into relevant needs or capabilities. Consider this relatively simple example: a given 2-year-old may be tall for his age, but if he is somewhat under weight for his age, the latter measure may seem more important. However, if he has a broken arm, neither measure may loom large—at least until the bone is set. Once the arm is safely in a cast, all three pieces of information—weight, height, and broken arm—may contribute to a clinical diagnosis that would have been difficult to make without any one of them.

It is my hope that the educational community will choose to adopt high standards for measurement, then put measurement in its place—alongside good clinical judgment, reflective life experience, qualitative observations, and honest feedback from trusted others.

Please follow and like us:

What is a holistic assessment?

Thirty years ago, when I was a hippy midwife, the idea of holism began to slip into the counter-culture. A few years later, this much misunderstood notion was all the rage on college campuses. By the time I was in graduate school in the nineties there was a impassable division between the trendy postmodern holists and the rigidly old fashioned modernists. You may detect a slight mocking tone, and rightly so. People with good ideas on both sides made themselves look pretty silly by refusing, for example, to use any of the tools associated with the other side. One of the more tragic outcomes of this silliness was the emergence of the holistic assessment.

Simply put, the holistic assessment is a multidimensional assessment that is designed to take a more nuanced, textured, or rich approach to assessment. Great idea. Love it.

It’s the next part that’s silly. Having collected rich information on multiple dimensions, the test designers sum up a person’s performance with a single number. Why is this silly? Because the so-called holistic score becomes pretty-much meaningless. Two people with the same score can have very little in common. For example, let’s imagine that a holistic assessment examines emotional maturity, perspective taking, and leadership thinking. Two people receive a score of 10 that may be accompanied by boilerplate descriptions of what emotional maturity, perspective taking, and leadership attitudes look like at level 10. However, person one was actually weak in perspective-taking and strongest in leadership, and person two was weak in emotional maturity and strongest in perspective taking. The score of 10, it turns out, means something quite different for these two people. I would argue that it is relatively meaningless because there is no way to know, based on the single “holistic” score, how best to support the development of these distinct individuals.

Holism has its roots in system dynamics, where measurements are used to build rich models of systems. All of the measurements are unidimensional. They are never lumped together into “holistic” measures. That would be equivalent to talking about the temperaturelength of a day or the lengthweight of an object*. It’s essential to measure time, weight, and length with appropriate metrics and then to describe their interrelationships and the outcomes of these interrelationships. The language used to describe these is the language of probability, which is sensitive to differences in the measurement of different properties.

In psychological assessment, dimensionality is a challenging issue. What constitutes a single dimension is a matter for debate. For DTS, the primary consideration is how useful an assessment will be in helping people learn and grow. So, we tend to construct individual assessments, each of which represents a fairly tightly defined content space, and we use only one metric to determine the level of a performance. The meaning of a given score is both universal (it is an order of hierarchical complexity and phase on the skill scale) and contextual (it is provided to a performance in a particular domain in a particular context, and is associated with particular content.) We independently analyze the content of the performance to determine its strengths and weaknesses—relative to its level and the known range of content associated with that level—and provide feedback about these strengths and weaknesses as well as targeted learning suggestions. We use the level score to help us tell a useful story about a particular performance, without claiming to measure “lenghtweight”. This is accomplished by the rigorous separation of structure (level) and content.

*If we described objects in terms of their lengthweight, an object that was 10 inches long and 2 lbs could have a lengthweight of 12, but so could an object that was 2 inches long and 10 lbs.

Please follow and like us:

Teacher pay and standardized test results

At the end of October, the Century Foundation released a paper entitled, Eight reasons not to tie teacher pay to standardized test results. I agree with their conclusions, and would add that even if all standardized tests were extremely reliable and measured exactly what they intended to measure, this would be a bad idea. This is because success in the adult world requires a multiplicity of skills and forms of knowledge, and tests focus on only some of these, one at a time. Until we can construct multifaceted longitudinal stories about the progress of individual students that are tied to a non-arbitrary standardized metric, we should not even consider linking student evaluations to teacher pay.

Please follow and like us:

Predicting trends, testing people

Mark Forman, in his response to the post entitled, IQ and development, wrote about the difference between predicting trends and testing individuals. I agree that people, including many academics, do not understand the difference between using assessments to predict trends and using assessments to make judgments about individuals. There are two main issues: First, as Mark argues, questions of validity differ, depending upon whether we are looking at individuals or population trends. If we are looking at trends, determining predictive validity is a simple matter of determining if an assessment helps an institution make more successful decisions than it was able to make without the assessment. However, if a test is intended to be useful to individuals (aid in their learning, help them determine what to learn next, help them find the best place to learn, help them decide what profession to pursue, etc.), predictive validity cannot be determined by examining trends. In this case, the predictive validity of an assessment should be evaluated in terms of how well it predicts what individual test-takers can most benefit from learning next, where they can learn it, or what kind of employment they should seek—as individuals.

The second issue concerns reliability. Especially in the adult assessment field, researchers often do not understand that the levels of statistical reliability considered acceptable for studies of population trends are far from adequate for making judgments about individuals. Many of the adult assessments that are on the market today have been developed by researchers who do not understand the reliability criteria for assessments used to test individuals*. As a consequence, the reliability of these assessments is often so low that we cannot be confident that a score on a given assessment is truly different from any other score on that assessment.

*Unfortunately, there is no magic reliability number. But here are some general guidelines. The absolute minimum statistical reliability for an assessment that claims to distinguish two or three levels of performance is an alpha of .85. To claim up to 6 levels, you need an alpha of .95. You will also want to think about the meaning of these distinctions between levels in terms of confidence intervals. A confidence interval is the range in which an individual’s true score is most likely to fall.  For example, in the case of Lectical™ assessments, the statistical reliabilities we have calculated over the last 10 years indicate that the confidence interval around Lectical scores is generally around 1/4 of a level (a phase).

Advice: If statistical reliability is not reported (preferably in a peer reviewed article), don’t use the test.

Please follow and like us:

Testing the limits of testing

The NTS is an interactive online survey that asks about (1) the legitimate purposes of testing and (2) how well today’s tests serve these purposes. In addition to completing a set of survey questions, respondents are offered an opportunity to write about their personal experiences with testing and share alternative testing resources. When respondents have completed the survey, they can view their results and compare them to national averages. Anyone who visits the site can read respondents’ stories, explore the resources, and track national results. Please participate in the NTS, and use your email lists and social networks to spread the word! Feel free to circulate the NTS poster or the poster announcing the NTS launch event. Contact Zachary Stein if you have questions or would like to become involved.

NTS launch event: Testing the limits of testing

Thursday, May 28th, 2009, 4:00 – 5:30 pm

Zachary Stein, Marc Schwartz, and Theo L. Dawson

The launch event will occur just prior to the opening of the second annual conference of the International Mind, Brain, and Education Society (IMBES) at the Sheraton Society Hill Hotel in Philadelphia, Pennsylvania. At this event, speakers will present preliminary data from the NTS, examine the limits of current test development methods, and explore new approaches to assessment, incorporating the perspectives of stakeholder groups who have participated in the survey so far.

More information is available on the NTS site.

Admission to the launch is FREE and open to the public, but space is limited. To attend, you must obtain a ticket from the NTS web site.

The conference will also feature a workshop on testing:

Educational testing for the 21st century: Challenges, models, and solutions

10:45 – 3:45, Saturday, May 30

Kurt Fischer, Marc Schwartz, Theo Dawson, Zachary Stein

The most basic form of educational testing takes the form of a “conversation” between an individual student and a teacher in which the student reveals what he or she is most likely to benefit from learning next. This kind of conversation increasingly takes a back seat to standardized forms of assessment that are designed to rank students for purposes that are dissociated from learning itself. Testing has lost its roots. The statistically generated rankings of standardized tests tell us very little about the specific learning needs of individual students. And it is becoming increasingly apparent that the kind of knowledge required to succeed on a typical standardized test bears little resemblance to the kind of knowledge required for adult life. The challenge we now face is creating the kind of mass-customization that revives the educative role of assessments in the local dialogue between teachers, students, and the curriculum, while maintaining the advantages of standardization. Simply stated: we need tests that help teachers meet the learning needs of individual students–tests teachers ought to teach to. In this workshop, we explore perspectives on these issues from the classroom, cognitive developmental science, psychometrics, and philosophy and offer a concrete vision for the future of assessment. The workshop is intended for educators, administrators, researchers, and policy makers. It is FREE to those who register for the entire IMBES conference. If you are interested in attending only the workshop, the fee is $80 before April 28th, and $95 after April 28th.

You can register for the conference or the workshop at the IMBES site.
Please follow and like us:

Construct and ecological validity

Test developers face a tension betwen construct and ecological validity. If a test is (1) measuring what it intends to measure (construct validity) and (2) what it is measuring is of value (ecological validity), it is considered to be a valid test. Sounds pretty straightforward, but it's not. That's partly because construct and ecological validity often compete with one another—and it is a challenge to find the right balance.

For example, it seems pretty obvious that math items should be about math and reading comprehension items should be about reading comprehension. So, to make sure a math test has construct validity—is about math—you ought to limit the amount of reading required to understand your test items, right?

But what if what you really want to know is how students tackle real-world math problems, which often require the ability to understand the context in which mathematical problems are encountered. After all, there are good reasons to think that a skill a student can apply in real-world contexts is superior to a skill a student can only exhibit on a test that is stripped of context. If you followed this line of reasoning and composed your test of questions that reflect how knowledge is used in the world outside of the classroom, it would have ecological validity.

Here lies the tension between construct and ecological validity: While including context in your math test would increase its ecological validity, doing so would increase the risk of reducing its construct validity by making it less clear exactly what is being measured. This might be reflected in lowered scores for students who can do math but aren't good readers or are unfamiliar with the kind of situations described in test questions. A result like this can look a lot like discrimination—especially when the stakes are high.

In sum, the more you strip away context, the more you risk lowering ecological validity. The more context you add, the more you risk lowering construct validity. Today, there is a strong tendency to prioritize construct validity over ecological validity, primarily because the stakes of many tests are very high, which increases our focus on anything that seems to interfere with fairness. Without intending to, test developers, policy-makers, parents, and teachers have contributed to the creation of tests with decreasing ecological validity—and there is no doubt that teachers are teaching to these tests. The implication? What students are learning in our public schools is increasingly irrelevant to competence in the real world. 

This is a cause for concern.

Please follow and like us:

Statistics for all: What the heck is confidence?

Confidence in testing

I doubt there is a person in the Western world over the age of 4 who hasn’t taken a psychological or educational test. Yet very few of us know one of the most important facts about these tests—their scores are always imprecise.

When you measure the height of a child, you can be pretty confident that the measurement you make is correct within a fraction of an inch on either side. And if you check the time on your mobile phone, you can be pretty certain that it is accurate within a fraction of a minute on either side. Rulers and clocks are well-calibrated measures that we can use with great confidence if we use them correctly. The same is true of measures of temperature, speed, frequency, and weight.

But even measurements made with these metrics are more or less precise. They’re correct within a range. These ranges are called confidence intervals. The confidence interval around the measurement of a child’s height would be expressed as something like “82 centimeters plus or minus 1/2 of a centimeter.” Statisticians would say that the child’s true height is likely to be somewhere in this range.

Scores on educational and psychological tests have confidence intervals too. But there is a difference between these confidence intervals and those for physical measurements. The confidence intervals around scores on psychological and educaitonal tests are larger than the confidence intervals around measurements in the physical world. How much larger? Let’s look at an example.

The psychological and educational tests with the smallest confidence intervals are those made by high-stakes test developers like ETS. For their high stakes tests — the ones used to make decisions like who gets to go to which college — they set the highest standard. This standard, if it was applied to measuring height, would allow us to to say something along the lines of, “We’re confident that this child is 82 centimeters tall, give or take 8 centimeters.”

Now, you may argue that 8 centimeters isn’t all that much, but if you’re buying a car seat or deciding who gets to ride a roller coaster, it could be the difference between life and death. Measurement precision matters.

The more imprecise our measurements are — the bigger the confidence intervals around them — the more careful we need to be about the kinds of decisions we make with them. When it comes to educational and psychological assessment, I think we’re far too careless. Too many people who buy and use assessments don’t know enough about statistics to make well-informed assessment decisions.

Fortunately, I believe we can remedy this! And it seems to me that the best place to begin is with confidence, so, in the next article in this series I’m going to share a super-easy way to figure out how much confidence you can have in any test’s scores.

 

Please follow and like us:

Test accuracy

How do you know if the score you get on a test is accurate?

This depends on what you mean by accurate. If you mean, “Can I be sure that the score I receive on a test is an accurate representation of my performance?” all you need to know is if the test was scored accurately.

However, if you mean, “Can I be sure that the score I receive on a test is an accurate representation of my true abilities, competence, attitudes, dispositions, or opinions?” you can’t. It is impossible for a single test, no matter how good it is, to guarantee an accurate assessment of any of these things. In fact, it takes multiple assessments and multiple kinds of assessments to build up anything like an accurate picture.

I think what test developers do is a black box for most of us. We tend to assume that there is some kind of special insight test developers have by virtue of their psychometric tools (or knowledge about learning) that allows them to get inside of our minds and make accurate judgments about what’s going on in there. But there isn’t. All a single test can measure is performance on the items on that test. The rest is inference.

Realizing this, you may wonder why single scores on single tests are being used to make high stakes decisions about anything.

Please follow and like us: