Too Teeter and Too Totter: March 2011

There's been a lot of news lately about cases of confirmed or suspected cheating by administrators and teachers on high-stakes tests. This is, of course, not surprising. When a single measure such as test scores is used to make decisions about school funding, jobs, and whether or not to keep a school open, you can be sure there'll be outright cheating. But there's a much stickier question that arises in this kind of environment, one that's less cut-and-dried than teachers telling kids to change their answers on a test: where do we draw the line between test preparation and cheating? When does test prep render our test results less useful?

I was enthralled, fascinated, and suffering from a bit of an intellectual crush a few months ago when I read Measuring Up: What Educational Testing Really Tells Us, by Dan Koretz. (I used to call him Daniel, but now he's Dan to me.) What amazed me most of all was that neither I, nor seemingly most of the education practitioners with whom I work, really understand very much about how educational testing works. These tests are the single most powerful force driving our daily work -- and most of us just accept the conventional wisdom about them without blinking.

Conventional wisdom: tests don't measure everything that's important, but it's too hard to measure everything that's important. (This is true.) Since there's no way to measure everything important, we'll have to just focus on the tests, for lack of anything better. (I've definitely moved more in this direction over the years, but that's a mistake.) The tests do give you important information about how your school is doing -- and if your kids are doing well, there must be a lot of good teaching going on. (Definitely not true.)

Probably the most important thing I learned from Koretz's book was the existence of Campbell's Law. Campbell's Law is, it turns out, is a well-known rule of social science. It states:

The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor.

Examples in Koretz's book, and in the blog post linked to above, include the unintended consequences of assessing the on-time record of airline flights and the death rate of heart surgeries. When data on the death rate of individual heart surgeons began to be collected and disseminated, in an effort to assess their surgical skill, surgeons became more reluctant to operate on the most dire of cases -- who were, of course, more likely to die during surgery, thereby sullying their statistics. Their statistics improved, but health care for heart patients did not.

When data on the on-time rate of individual airlines' flights were publicized, all airlines' data began to improve, even though individual travelers didn't notice an improvement in their own flight experiences. Why? Airlines began to pad their estimates of travel times, building in time for regular delays. Even if delayed, flights appeared to arrive "on time," and many flights arrived early. (This latter example isn't egregious -- it kind of seems like good planning. The point is, though, that the impression that the data were improving is misleading.)

Campbell's Law has been coming up lately in the conversation about high-stakes testing (which makes me wish I'd written this blog post a few months ago, as I intended to -- now others have beaten me to the punch.) It's essential that schools get good test scores, since their performance on tests is tied to pretty much everything -- so, scores will go up. But increased scores don't mean that students are learning more. It just means that students are getting better at that particular test.

Koretz did a study (which was very hard to do, because no one wanted it done in their district) to measure this effect. In a large urban district, the average third-grade score on the standardized math test in 1986 was a grade-level equivalent of a 4.3. Then, the district switched to a new test. The next year, test scores plummeted to an average of 3.7. For the next three years, though, the scores rose until third graders were again scoring at a grade-level equivalent of 4.3.

In the fourth year, Koretz also administered the old test -- the one that, four years ago, third graders had been doing well on. And what do you think happened? While on the new test, the one their teachers had been teaching to for four years, they scored an average of 4.3, on the old test, the average score was 3.7 -- exactly the score their counterparts had originally earned on the new test when it was first used.

In other words, students weren't getting better at math in those four years, even though their test scores were improving. They were getting better at that one particular math test.

So why do we use educational testing? Most of us would agree, I think, that we use tests so we can see what students know and can do: what they are learning. When high stakes test scores go up, they tell us what they are learning in just one realm -- they tell us that they are learning how to take specific tests. Teachers are adjusting the content they teach to match those specific tests, and teachers are passing on test-taking tricks that help their students score better.

If we want tests that tell us what a specific school, or grade or district or country, knows about math, though, high-stakes tests don't tell us much. What amazes me is how little people are talking about this. Campbell's Law is well-known -- but I certainly didn't know about it. No one in my school or my district said, "You know, because these tests have so much tied to them, they don't tell us much about what kids know." Instead, everyone told us we had to do a better job of preparing our students to pass the tests. As if that's why we became teachers.

A month or so ago, I attended (and walked out of) a staff meeting that was purportedly about how to teach our students to answer Open Response questions on the state high stakes test. I thought it might be useful; I had noticed that my students weren't so good at reading a question, understanding what it was asking, and distilling what they knew about the answer into a few sentences. I thought I might get a few ideas about how to help kids write more clearly about what they knew.

It turned out to be a presentation about tricks that help kids get higher scores on Open Response questions, not how to write about what you know in response to a question. The speaker told us important details about how the questions are scored, including the fact that the organization of the response isn't scored at all. All the scorers look for is the right content, so students don't need to worry about writing topic sentences or spelling. "This is just about scoring better on Open Response," she said, "not about being better writers."

Please bear in mind: this woman was brought to our school by our school district. This workshop was a version of one she had attended put on by the Department of Elementary and Secondary Education -- yes, the same people who bring us our state test. So, they make the test, they create the benchmarks, they score the test -- and, they teach us how to help our kids do better on the test. Do they want to measure what our students know about reading and writing? Or do they just want better test scores?

I recently took an educational test -- the GRE. And I did a lot of test prep for it. I reviewed a lot of math concepts, things I once knew but have since forgotten. Did I learn them deeply, in such a way that I still know them now, three months later? No. I memorized a few formulas and a lot of tricks, including the fact that, for example, the GRE mostly uses 3, 4, 5 triangles and 5, 12, 13 triangles. (Doesn't that sound vaguely familiar from geometry?)

Was I cheating? Not technically. Did I do well on the test? Yes. Did my performance indicate what I really know about math and can use on a daily basis? Absolutely not.

I'm not advocating that we keep using educational tests, with their somewhat arcane language and strange scoring rules, and don't prepare students for them at all. The risk we run then is that students' test scores will underestimate what they know and can do (which surely happens now as well). But it does seem like a perverted system when we put so much of our teaching time and energy into helping kids beat the tests. All this talk about extending the school day and the school year -- while students in grades 3 and up can spend up to 30 school days a year taking tests (not counting all the days they spend prepping for them). Talk about lost learning time.

Ten years ago, I started out as a teacher who thought "standards" was a dirty word and "testing" was worse. Over the years, my views became somewhat more mainstream, as standards became such a fact of teaching that we couldn't imagine schools without them and testing more and more determined our fate. Now I'm on my way out of the classroom, at least for a year, and I'm coming back to where I started -- disillusioned by arbitrary standards and test results that tell us more about a student's socioeconomic background and test-taking smarts than they do about what she really knows.

Too Teeter and Too Totter

Thursday, March 31, 2011

Cheating and other hazards of high-stakes testing