Should the Plain Dealer publish value-added scores?

Saturday, July 13, 2013

On June 22, the Cleveland Plain Dealer, which is the largest daily newspaper in Ohio, published an editorial describing the paper’s own release of “value-added measures” for the state’s teachers as “an illuminating series,” here. Many states, including Maryland and Illinois, are increasingly using value-added model scores to evaluate teachers in the public schools, making this a timely and important issue.

The series may have been illuminating, as the Plain Dealer’s editorial board self-describes it, but the real question for me was, What exactly does it illuminate? What can we learn—and count on—from the release of the VAM score for every public school teacher in the state?

The answer, say hundreds of experts, is nothing.

Which naturally begs the question, Why do newspapers—like the esteemed Plain Dealer and Los Angeles Times, which published the VAM scores from California public school teachers in 2010 and 2011, and the New York Times, which did something similar for New York in 2012—consider this information newsworthy and publish the scores in the first place?

What are value-added ratings?

The term “value-added” applies to teachers as follows: The assumption is that each teacher a child has will teach that kid something. Therefore, if you subtract what a kid knows at the beginning of the school year from what he or she knows at the end of the school year, that difference is the measure of how much value that teacher added.

The scores are designed to remove extraneous variables from the measurement, such as the fact that kids enter a grade level at widely varying levels. Not every third grader, for example, in Mrs Becker’s class will be reading at the same level at the start of the year. Let’s use an arbitrary scale from 0 to 100+ to illustrate the point and let’s say Mrs Becker has 10 students in her class, who stay with her all year. Their hypothetical reading scores at the beginning and end of the year are as follows:

Ashley Anderson — 80 — 83
Bill Bialins — 60 — 70
Cory Camilia — 60 — 90
Daniel Danielson — 70 — 82
Edwina Edgerton — 30 — 60
Frank Fitzsimmons — 115 — 112
Gabrielle Gustafson — 75 — 65
Jonathon Juman — 70 — 95
Kelly Keppernick — 40 — 110
Lindsay Lohan — 80 — 100

For the purposes of this example, we’re going to assume that a score of 100 means that the student is reading “at grade level” for third grade, and every 10 points represents “one grade level” of reading achievement. Thus, Kelly finished the year reading slightly above grade level, and Mrs Becker, all by herself, gets credit for adding value to Kelly’s education, at least in terms of her reading performance.

In theory, if we take the average difference between the ending scores and beginning scores for Mrs Becker’s students, we would obtain the teacher’s “value-added score.” Let’s do the math first:

The average increase for Mrs Becker’s students was 18.7, or about 1.87 “grade levels” in reading. The system expects each teacher who has students for a year to add about one year’s value of education to that child’s experience, and Mrs Becker exceeded that goal on average. Let’s take a closer look, though.

The data show that Kelly improved by about seven grade levels (70 points) throughout the course of the year. It’s likely Mrs Becker saw an opportunity for big gains from a student who came into the class reading six years below grade level, although we can assume Mrs Becker’s motives for helping Kelly were pure. The scores say nothing about teachers’ motivation.

If Kelly is excluded from the list, though, Mrs Becker only produced an average gain of 1.3 years, which is still good, but not quite as impressive.

1st problem: the assessment instruments

It’s not quite as simple as my example above when it comes to determining each student’s score. Ultimately, the score comes from standardized tests, which are made up of test questions or items that purportedly measure what the test developers say they do. This is usually the case, but some items miss the mark. For example, one item on the Maryland School Assessment in science in 2012 was the following:

Students made lemonade using the following recipe:

100 grams of lemon juice

100 grams of sugar

1,000 grams of water

The students combined the lemon juice, sugar, and water in a container. They stirred the lemonade until all the sugar dissolved. They poured the lemonade into a plastic tray and put the tray in a freezer. The next day, the students removed the tray from the freezer and observed that the lemonade was a solid.

Which statement best explains why the lemonade became a solid?

A. The lemonade was cooled to 100°C.
B. The lemonade was heated to 100°C.
C. The lemonade was cooled below 0°C.
D. The lemonade was heated above 0°C.

You and I may know the answer, but this question was put on a standardized test (and subsequently released into the public domain) because Maryland was interested in determining how well students had mastered the following learning objective in science:

In chemistry, on states of matter … Provide evidence from investigations to identify the processes that can be used to change materials from one state of matter to another. Specifically, observe and describe the changes heating and cooling cause to the different states in which water exists. Heating causes: ice (solid) to melt forming liquid water; liquid water to evaporate forming water vapor (gas). Cooling causes: liquid water to freeze forming ice (solid); water vapor (gas) to form liquid water.

The problem is that a correct answer to the question requires a student to understand that the melting point of ice (solid) is 0°C, not 100°C. The learning standard supposedly being assessed, however, says only that students at this grade level are required to know that the act of cooling causes a liquid, lemonade in this case, to turn to solid. According to the state’s published curriculum, a student who has a full understanding of that learning standard or performance expectation will not be able to choose between answer A and answer C. Choosing between A and C, while it seems simple to you and me, is technically beyond the level of the state’s published curriculum.

(It has been pointed out to me that the freezing point of lemonade isn’t technically 0°C, so the question above may have other issues. But even if 0°C had been changed to –2°C or something lower, the question still requires students to know something the state has said teachers aren’t required to teach, and that, really, was the only point I was trying to make.)

So, when the US Education Department publishes a report that says assessments being developed by the Partnership for Assessment of Readiness for College and Careers (PARCC) are essentially on track, as the department did yesterday (PDF), I would remind you to remain cautious. The devil is in the details, and the details are the test items themselves. Just because a consortium or state department of education says the test is “aligned” to the curriculum doesn’t make it so.

As we have reported on numerous occasions, if the questions aren’t testing the standards the state says they measure, then the test on the whole can’t possibly measure how well teachers are teaching that curriculum to our students. I wonder if we’ll be allowed to examine the questions on tests used for teacher evaluations.

2nd problem: the statistical complexities

The remaining “big” issues are a little easier to understand, mainly because they require so much knowledge of mathematics that they come close to the formulas for derivatives used to swindle investors out of billions of dollars by crooks on Wall Street.

When I posted hypothetical student scores above, it was simple: one number represents how well each student is performing in reading, compared to grade level. By adding 10 points or one year to the beginning number, I obtained an expected value for each student at the end of the year. That is, a student who started out at 80 would be expected to be at 90 by the end of the year. If that happened, the teacher did an adequate job with that student, assuming the test was perfect.

The District of Columbia Public Schools uses this idea of “expectation” in its new teacher evaluation system, according to the district’s website:

First, we calculate how a teacher’s students are likely to perform, on average, on our standardized assessment (the DC CAS) given their previous year’s scores and other relevant information. We then compare that likely score with the students’ actual average score. Teachers with high [value-added] scores are those whose students’ actual performance exceeds their likely performance.

First question: What exactly is the “other relevant information” and where does that data come from? Does it include socioeconomic status, race, or what? We saw recently in Alabama that different settings were used in terms of expected gains for African-American students, compared to white or Asian students, as education historian Diane Ravitch reported on her blog.

Although the same tactic is used in other states, the whole idea strikes me as being the unequal treatment of our citizens based on race. We should instead focus on equality of opportunity for all our students, regardless of their race, ethnicity, socioeconomic status, etc. But since that doesn’t seem to be how it’s done in practice, let’s focus on the practice itself.

In practice, complex statistical formulas have to be used in order to take all these extraneous factors into account and determine the correct weight to give each factor in those formulas. This is statistical magic, and it has the effect of producing seemingly valid answers from mountains of data. I’m not sure the answers are really in there, because I don’t understand the manipulations being done to the data, but statisticians say the answers are good enough.

And that begs yet another question: How good is good enough?

A massive 2010 report issued by the US Education Department entitled “Error Rates in Measuring Teacher and School Performance Based on Student Test Score Gains” (PDF) says based on the error rate and using three years of data, there’s about a one in four chance that a teacher who is “average” would be identified as “significantly worse than average.” That teacher may then be fired, even though firing her would mean losing a teacher who is performing near the middle of the road.

There are, of course, many reasons why it’s so hard to infer a teacher’s quality from value-added scores, but my point is simply that statistical wizardry, which reports a single number that is supposed to represent a teacher’s value-added score, hides so many needles in the haystacks that it’s impossible to find any needle in a haystack. We’re looking for teachers we need to work with and those who need to be fired. If we accept a one-in-four chance that we’ll get it wrong, we will waste lots of valuable resources on teacher remediation and we’ll miss opportunities to identify potential mentor teachers.

I think that’s too high a probability of going down a dark alley of teacher evaluation, and I wish we weren’t taking this road.

3rd problem: complexities not even the complexities can measure

Even with all the statistical wizardry referenced above, we still can’t account for a multitude of situations in real-world classrooms, such as the fact that no kid in America is taught only by one teacher. Often, there are reading specialists who come in to work with elementary students—but not all of them in any one teacher’s classroom—math and science specialists, guests who may inspire a kid to relate mathematics to his real-life experiences, and so on.

During any given year, as measured by the value-added scores, many more individuals drive the education of each child than just one teacher. And that’s how schools are designed. Not every teacher can get through to every kid, but chances are, another teacher will, so the kid wins.

In fact, I would suggest schools retain teachers who rarely get through to their average students, because these are often the teachers who have a gift of working with special-needs students or those who excel at a rate the rest of their classes can’t match. Their value-added scores may not be high, but these teachers may show promise in terms of getting through to students glossed over by teachers with better average value-added scores.

4th problem: unreliability of value-added scores

Since we first computed value-added model scores, we have noticed that scores fluctuate from one year to the next. If they go up, it would seem to indicate that teachers got better; if they go down, the opposite conclusion might be indicated. But that assumes the tests measure teaching quality, which they don’t, as shown above.

Here, I turn your attention to the fact that wide fluctuations are not expected to occur from one year to the next. Because they do, people have a right to challenge how much they can rely on or count on these scores to identify good or bad teachers.

According to a briefing by researchers at Stanford, Arizona State, and the University of California, Berkeley, between 19 and 41 percent of teachers show a change of more than three deciles from year to year, and 74 to 93 percent show a change of more than one decile. The three-decile change should concern those who tout value-added scores, because it is highly unlikely that a teacher actually got better or worse than 30 percent of teachers in the system from one year to the next.

That’s like saying in a system like Chicago Public Schools with 23,290 teachers, about a third of them—or about 7,700—saw their ranking among all the other teachers change by more than 30 percent. This is significant, because if we are looking to fire, say, the lowest 10 percent of the teachers, a teacher who is in the 40th percentile, which is close to the median, may drop down to the 10th percentile the very next year and be cut.

And here’s an even bigger problem: Some systems are insisting that a teacher’s performance be consistently poor before the teacher is dismissed. That would seem to balance out the problem just explained, but in fact, it only makes it worse. If the scores are so unstable that a third of them are likely to be off by as much as 30 percentiles, it’s just as likely that the variation happens in all three years for which data are collected as in one of those years.

It’s a little like the lottery. Buying multiple tickets doesn’t appreciably increase your odds of winning. If the value-added scores have a certain probability of being unreliable, they have just about that same probability of being unreliable three years in a row, just as the winning numbers have about the same chance of being wrong for the first ticket you buy as they do for the second and subsequent tickets.

In order to change the odds appreciably, you have to buy a whole lot of lottery tickets, and our kids just don’t have that kind of time for us to play with the numbers. They’re in school now. They’re getting their education now, from the teachers that are standing in front of their classrooms. There’s no time to waste, and waiting for statistical analyses to come back from test-design companies takes too long.

5th problem: statistical noise at both ends

Finally, much has been said of value-added scores for low-achieving students and for high-achieving students. Let’s start at the top. High-achieving students, say those like Frank in Mrs Becker’s hypothetical classroom above, are handicapped by value-added scoring models because teachers are told by their districts to get kids reading at grade level. Those who start close to the finish line tend to slow down so the year doesn’t become a total bore.

As a result, a fourth-grade teacher who starts the year off with mostly high-achieving students won’t realize as great a gain in her value-added scores as a teacher who starts off with a class reading at about one year below grade level. This artifact in the statistics occurs because of the teacher’s job description. She has been handed the state’s (or district’s) fourth-grade reading curriculum, with its suggested passages, test items, and so on, and told to work with it.

Many teachers will take their marching orders from the state or district as the floor of what they teach their students, but reformers aren’t training them to do this as much as they did in the past. Reformers are pushing the Common Core, equal outcomes for all students. This is a step in the wrong direction in that equality should be expressed in terms of the opportunities provided to each child, not in terms of the outcome to be expected.

At the low-achieving end of the spectrum, teachers who have many English language learners in the classes or several students with special needs spend an above-average amount of time accommodating those students and a below-average amount of time teaching the state’s curriculum or helping students meet grade-level performance expectations.

And that’s what they’re supposed to do. The problem happens because students aren’t exactly randomly assigned to teachers. In fact, many high school math teachers insist on teaching only honors-level or AP classes because they consider themselves more knowledgeable in the content than other teachers at the school. These teachers will get lots of high-end students, so their value-added scores may not be so high, since state tests aren’t designed to discriminate between students at such a high level.

Even though value-added models try to take these student differences into account, they can’t predict a student’s ability level very well since that’s the trait they’re supposed to measure. The value-added scoring would become more of a self-fulfilling prophecy than an objective measure if too much energy were invested in the predictions. In other words, the schools would spend more money to predict how students should perform than it would on listening to and analyzing how students are actually performing and where additional resources might help.

We close with a reference to a report from the National Academies:

A student’s scores may be affected by many factors other than a teacher — his or her motivation, for example, or the amount of parental support — and value-added techniques have not yet found a good way to account for these other elements.

Until we find a way to eliminate the statistical noise introduced by these “other factors,” we strongly recommend scrapping value-added methods for now. Publishing the reports in a newspaper, especially a good one, reduces the credibility of that news organization a little, and although we support providing the public with all good data about our schools—”all the news that’s fit to print,” as the New York Times puts on the front page of its print editions—this information does not fall in that category and should not be printed with the imprimatur of a news organization.