Computer scoring exposed

Paul Katula

12 years ago

The Education Week magazine has published an inaccurate characterization of the standardized test-scoring process and should retract the blog post entitled “Computerized Grading: Purloining the Analysis, the Most Fundamental Exposition of Humanity,” written by Anthony Cody and published on May 3.

The article states: “The study compared computer grading to human scoring, but the humans doing the scoring were not, in fact, teachers. They were mostly temporary employees paid low wages, working in mass scoring facilities, as former testing company employee Todd Farley pointed out. And under these conditions, human scoring is nothing to brag about.”

Fact: Not one single “human” who scores any standardized test given by Maryland is paid a low wage. Some pull in north of $20 an hour, depending on the quality of their work, but no one makes less than $12 an hour. Also, a great many of them, though certainly not the majority, are retired teachers. To suggest, as the phrase “humans … were not, in fact, teachers” does, that pulling teachers out of classrooms to score thousands of very similar essays is a good idea is preposterous. Finally, Mr Cody misses the point that many good jobs in today’s economy, including those of the president and members of Congress, are temp jobs.

The article states: “We cannot afford to pay humans — even low wage ones working in hot warehouses somewhere — to score millions of essays.”

Fact: Not one single “human” who scores any standardized test given by Maryland works in “hot warehouses somewhere.” I am right now monitoring the scoring on the Maryland School Assessment in science, and the office space used by scorers is large indeed—hundreds of scorers have been hired—but it is carpeted, nicely air-conditioned, and very office-like. In fact, the working conditions here are better than those in my cubicle back at the Maryland State Department of Education.

This type of misstatement has no place at Education Week, and the journal should issue a retraction.

A little about automated (computerized) scoring

Maryland uses artificial intelligence, or automated scoring, to assign one of two or more scores on a few questions on the MSA science test. It doesn’t work for all the questions, and we are continuously evaluating why it works for some questions and not for others.

However, on the scoring floor, I can tell you that it has nothing to do with students trying to figure out how to game the system, as Mr Cody alleges. The fact is, at least one person reads every single essay written by a Maryland student, which keeps the computer in check.

In one sense, AI scoring is complicated, but if you look at it as a slightly transparent black box, it’s not hard to understand. Writing is actually easier for AI, since it knows what words and ideas are connected, and if an essay strings a bunch of those together, it considers the argument to be well supported. In science, however, we use AI scoring, and it’s not so easy because there are right and wrong answers, and the computer doesn’t try to read textbooks. As a result, certain types of responses send it south for the winter, never to return.

The computer can score responses much more quickly than a person, but that’s the only real benefit of using a computer to score written responses. People, on the other hand, present several advantages over AI scoring.

For example, if a student writes, “Scientists an seam confused,” when he meant to say, “Scientists can seem confused,” the computer will probably get confused. Since both “an” and “seam” are perfectly good words, the computer will make no attempt to correct the spelling, and it will miss the meaning the student intended. On the MSA in science, the computer has been programmed to issue no score if it gets confused and send that essay to an expert scorer for evaluation, so it essentially knows when to give up.

Trained scorers, after seeing thousands of third-grade essays and knowing that kids misspell words all the time (sometimes I think I have lost my once-prized ability to spell since I have read so many student essays in my career), just read right through the numerous “misspellings” and missing punctuation marks and issue an appropriate score based on their training.

Also, some very smart people, who received genius grants from the MacArthur Foundation back in the 1990s, told me flat-out, computers will never be able to assess the wide diversity of human writing. And these were people who got their genius awards based on their work in artificial intelligence. And they were right, of course. Even today, computers can’t evaluate all of human writing, but tests don’t have to evaluate the wide range of student writing. Certainly the essay in Mr Cody’s article, which I think was generated by a computer, would never be written by a student on a test in response to any prompt I’ve ever seen.

Student writing, in response to specific questions, with certain text passages that pretty much direct students to use the text from the passage—that’s a completely different task for a computer scorer. The AI computer “learns” what score to give on the basis of real student responses. That’s why it sometimes doesn’t work, but it’s also why it’s better than scorers in some respects (i.e., sometimes more accurate, not just faster). For example, if several responses that are fed to the computer with high scores start talking about one particular quote from the passage, the computer associates that quote with high scores. It gets “wowed” when a student writes about that particular section of the passage, and it starts pushing up the score, just as a good scorer would be trained to do.

Of course, that’s also a downside of AI scoring: not every response that brings in material from a certain part of a passage should be credited. If a student uses that section to support an irrelevant idea, the computer will get confused, especially if the idea that’s not supported by the strong passage text is nowhere near the passage text itself. Real scorers can be trained to adjust and reorganize the essay before scoring it, but all the computer knows, generally, is that kids tend to support their ideas near the point in their essays where they write the ideas.

Another problem with computer scoring is just how kids write, making it impossible for the AI computer to tell if the student is talking about one category of response or another. For example, a science question may ask students to identify positive results and negative results from creating a three-year moratorium on crabbing in the Chesapeake Bay. Think about how kids respond to this. They usually don’t write “Positive: cab population recovers, bay ecosystem restored to balance. Negative: no crabs, no money for watermen.” Something like that would be easy for the computer to score.

They usually write something like this: “There are many positive and negative results from not crabbing for 3 years. The cab population recovers, and that would restore balance to the bay ecosystem. Although, we wouldn’t have crabs for 3 years, and I like crabs, so that would also cut into profits of watermen on the bay.”

That might trip up the computer, which is looking for specific identification of positive results and negative results. A person would just know that not being able to eat crabs out of the Chesapeake is a “negative” result of banning crabbing, even though the student never identifies it as such. The computer gets confused with the way kids write, because on a timed test, they’re usually just writing everything they know about a subject as it occurs to them and not necessarily labeling everything for a computer to read.

Most AI scoring takes several parameters into account from each written essay. In most cases, the computer has stored a bunch of responses, written by students, for which a state-approved score has been given. The computer learns from these pre-scored papers, and based on hundreds of parameters, whichever pre-scored responses are closest to the student’s essay will dictate the score assigned to the student’s essay.