An economics study last school year out of Harvard and the University of Chicago has found that paying teachers bonuses for improved student performance on standardized tests can result in higher scores on those tests if the bonus is timed properly and tied to both risk and reward, assorted news agencies are reporting.
The study is published as Roland G. Fryer, Jr., Steven D. Levitt, John List, and Sally Sadoff. Enhancing the Efficacy of Teacher Incentives through Loss Aversion: A Field Experiment (New York, NY: National Bureau of Economic Research, July 2012). Available in manuscript form, here.
About 150 teachers at nine K-8 schools in Chicago Heights, Ill., were randomly assigned to the control group or one of four treatment groups that differed according to whether teachers received bonuses up front or after demonstrating gains and whether bonuses was based on individual or team-based gains. Bonuses ranged up to $8,000.
Results showed that students whose teachers received up-front bonuses showed statistically significant gains in math (roughly .2 to .3 standard deviations), a pattern that held whether teachers were compensated as a group or as individuals. No significant impact was detected among the group that received incentives in the traditional fashion, nor were findings robust in reading.
News agencies are groping at the study, looking for something to support merit pay. It’s there, for sure, at least a little, but there are a few problems with the study itself, and I wanted to point those out to the excellent reporting staff at the Tribune and other organizations. (This part comes later.)
To show you what I mean about the Tribune, an op-ed in the Chicago Tribune says the study means merit pay works—or at least has the potential to “work” if it’s executed properly. The Tribune’s editorial board writes that they are disappointed merit pay is apparently off the table in the ongoing and troubling contract negotiations between the Chicago Public Schools, which wanted merit pay in any new contract, and the Chicago Teachers Union, which opposes any merit pay system.
Nobody, least of all me, wants the Chicago Public Schools teachers to go out on strike. There are more than 400,000 students in that district, and they would all be affected by a strike. We need to be pragmatic when it comes to reaching a deal. After we take care of our first priority, which is to keep kids in the classroom, we can stop holding them hostage for our favorite programs, regardless of how much research we can dig up to support our pet ideas.
Furthermore, the conservative-leaning Fordham Institute published these comments about it earlier this week: “The study puts a … flip on merit pay, investigating how the timing of monetary bonuses affects teacher performance. Instead of receiving bonuses after their students have demonstrated higher achievement, teachers in Fryer’s study were paid in advance and agreed to return the money at the end of the year if their students did not improve sufficiently.”
Yes, it does sort of turn the table around by saying if you give teachers some money up front and tell them they’ll have to give it back if their students’ test scores don’t improve, the students’ scores tend to show improvement by at least a little. It’s barely significant from a statistical point of view, but it’s there, at least in math. Reading, not so much, but that may be a result of the fact that kids typically have only one math teacher while several teachers provide reading instruction for students, especially those in lower grades.
But even if I wanted to, I couldn’t reproduce this study, because the researchers from otherwise respected institutions did a shoddy job of reporting the findings. They used 150 teachers in nine schools and one school district in a suburb of Chicago that hardly resembles the rest of the country. That’s a really small sample that may or may not have several “local influence” variables wreaking havoc with the variance and add an unknown level of bias to the results. The paper has not been subject to actual peer review, and you’ll see why it would fail to pass those tests in the discussion below.
And now, some other problems with the study
The researchers used Discovery Education’s ThinkLink Predictive Assessment, which has been shown (by the company’s own psychometricians) to be a reliable assessment for predicting results on the Illinois Scholastic Aptitude Test, used for grades 3 through 8, and the Iowa Test of Basic Skills, used in grades K through 2. It’s not the same test as the ISAT, in other words, but it does a good job of predicting students’ scores on the ISAT.
The problem here, of course, is that ThinkLink’s data are not available for public inspection. Discovery Education is a private company, and they have absolutely no obligation to respond to requests for records. For all we know, independent writers, like KentuckyLiteracy.org, who generally support ThinkLink Predictive Assessment, might rely solely on information provided by Discovery Education. It could all be one big ride on the company’s PR roller coaster. In fact, in their final report, KentuckyLiteracy.org quotes Discovery Education’s own website:
Based on scientific design principles that ensure reliability and validity, ThinkLink applies the Continuous Improvement Model to its own product development. Using content research, items matched specifically to each state’s standards and evidence-based analysis, revision continues at every stage of program development.
This is an ongoing process; each year ThinkLink performs extensive psychometric analyses to ensure teachers receive the most accurate and reliable, state specific feedback possible. …
But we simply don’t know, and the source of this report is potentially biased. However, the company is assumed innocent, and the test is assumed to be reliable and valid as an assessment instrument, aligned to the Illinois Learning Standards, which is what the high-stakes tests our students face supposedly test. We just don’t know.
But here’s the rub: The research report itself never once mentions the word “reliability,” “validity,” or even “reliable” or “valid.” Doesn’t it strike you as a gross misunderstanding of assessment instruments that researchers from great institutions like Harvard, the University of Chicago, and the University of California would do a whole study that relies on the results from a single assessment instrument and never, anywhere in the report, tell us what the reliability and validity of those assessment instruments are? It would just take one line in the 34-page report.
We might be able to find out by conducting another investigation with our own money, but that would take way too long for what it’s worth to know, and the Chicago Tribune and Fordham Institute will already have sold the public a bill of goods. We would have already drunk the Kool-Aid.
Please understand: I’m not saying the ThinkLink Predictive Assessment isn’t valid or reliable. I’m simply saying I don’t know, and the researchers didn’t report it. Good scientists would have included those stats in their published report, and the fact that they are omitted makes me wonder if the researchers know something they’re not telling me.
Also, I’m not saying researchers didn’t use the ThinkLink Predictive Assessment as it was intended to be used. But again, they don’t report it. How many questions were used? for example. Even if the assessment as a whole is a valid and reliable predictor of student performance on the ISAT, did these researchers use it that way for this study? I’m sure they did, and everything, because scientists always do all experiments properly and in good faith, but it is glaringly omitted from the manuscript submitted for publication in July.
Results reported to mirror ISAT results
Researchers say that “results on the state tests … mirror the ThinkLink results.” And they do. Check out Table 4 in the appendix, and compare it to Table 3. Sure enough, ThinkLink results appear to mirror the ISAT results. But in the caption to these tables, we find this interesting note:
All regressions include dummy variables for each students’ school and grade as well as baseline test scores. Columns (2), (3), (5), and (6) add controls for students’ race, gender, age, free-lunch status, English proficiency, and special education status.
That’s statistics talk for the following: We’re not actually reporting any actual kid’s score here, but we’re crafting a fictitious “kid” in these schools based on whether he’s black, male, poor, special-ed, or can’t speak a word of English. Statisticians love to do this kind of thing and then hope nobody reads the fine print.
Which is fine, usually, since most of us actually don’t read the fine print. But just keep in mind, we’re not talking about real kids, just characters in a story, like reporters used to make up in the olden days, before people found out that it was a dangerous and potentially libelous practice. The statisticians haven’t got the news yet.
Then, let’s take a look at what the report actually said. They used a baseline of each teacher’s three prior years on the ISAT. Compared to those three prior years, again I point out, with a completely different cohort of kids each time, and averaging those three prior years, and trying to accommodate through the wizardry of statistics and coefficients on regression formulas, for the race, gender, poverty-level, English proficiency, and even—this is my favorite—the “school” itself, we see an improvement of about two-tenths of a standard deviation if teachers are trying to avoid losing a pre-paid bonus.
But the bottom line is not as favorable to this research
These design issues and statistical wizardry cast doubt on the research. See, this is a problem for me, because I want to figure out if there’s anything politicians can do to make teachers do a better job of teaching kids. All the research up to this point says, probably not. And this study comes along and says, there is but you have to do it right.
Don’t just give them the money at the end if their students’ scores improved. Instead, threaten you’ll take away something you already gave them if their students’ scores don’t improve. Does that strike you as mean? Give a baby a piece of candy, and just after he gets it unwrapped, take it away. But tell him that you won’t take it away if he wears a certain coat to play outside. His old coat has kept him plenty warm so far this winter, but you have purchased a new, expensive coat for him. And he can’t have his candy if he doesn’t wear the coat you bought for him.
I question if this will even work with a baby. I suppose it depends on how much he likes the new coat you bought. I know for sure it won’t work with teachers.
The research study cited here was done blind. That is, teachers didn’t know what the results were being used for. After the first year of such a program, they’ll catch on. Then, good luck finding any improvement in scores.
Meanwhile, we’re treating our teachers more like a baby all the time. This program would be disrespectful to the professional standards of teachers, which is why teachers unions are opposed to it and to any merit pay program that has come along so far.
Look at it this way: If a grown-up isn’t sure such a program would work with his 4-year-old nephew, he sure isn’t going to want to try it on his nephew’s teacher. I have too much respect for her and her attempts to do her job to the best of her ability. That’s what I really ask for: not the best teacher in the 27 most industrialized countries, but a teacher who does her best to give my nephew the education he needs to be a happy person. I don’t really care so much about test scores, and they’re getting just a little old.
