Re: #9) PARCC field test will improve test quality

Paul Katula

12 years ago

The Maryland State Department of Education issued a document today advising parents in the state what they should know about testing in Maryland. This post is my personal, unsolicited, and non-endorsed response to the ninth “thing” parents should know: “Information from the PARCC field test will be used to refine and improve the assessments before they are given to all students next year. Test-makers will specifically look at whether test questions actually measure the knowledge and skills that they were designed to measure and if the questions are valid, reliable, and fair. The field test will also provide an opportunity for students, schools, and local school systems to become familiar with the new assessments before they are fully implemented in school year 2014-15.”

There’s a lot to look at here. Yes, the field test will be used by Pearson, ETS, and other vendors and subcontractors of the Partnership for Assessment of Readiness for College and Careers (PARCC) in order to determine which test questions produce the most “reliable, fair, and valid” measure of students. Let’s look at each of these:

Valid

In order for a test to be “valid,” it must measure what it purports to measure. The PARCC test purports to measure the learning standards (“knowledge and skill,” as MSDE puts it) in the Common Core State Standards, which were adopted by Maryland in 2010.

The standards are in mathematics and English language arts. I have shown that at least one math standard in third grade can’t be tested on a standardized test, and at least one other writer has shown that one of the high school standards in English language arts isn’t worth teaching to kids.

This brings us to a dilemma: How can a test be “valid” if it can’t possibly measure what it purports to measure and, in other cases, would measure something teachers shouldn’t teach. That is, if something can’t be tested, there’s a good chance teachers won’t teach it, because it won’t count as part of their evaluation. Then, if something can be tested but isn’t worth teaching, that means we’re teaching kids something just for the sake of a standardized test.

Oh, you bet teachers are going to teach it, because their evaluation will depend on kids learning a new trick, which has no relevance to English language arts, but that’s the kind of irony we get when we test kids on standards that need to be revised before we give kids a test and we hold teachers to those substandard learning standards.

Reliable

Reliability means we would get the same result, within a margin of error, no matter how many times we repeated the test or which sample of students we give the test to out of our entire student population.

Well, there’s no need to worry about the last part: federal law requires that every kid in our public school population take the test every year, which makes the question of determining a truly random sample moot. We can talk about revising that part of the federal law known as No Child Left Behind (actually the Elementary and Secondary Education Act is the official name of the law, but the original version didn’t have the massive reliance on standardized testing found in the changes introduced under the George W Bush administration, known as No Child Left Behind).

As for the first part of reliability—the margin of error—that’s a number that simply is whatever test designers say it is. Among the problems with determining a margin of error that is acceptable is the fact that these standardized tests are influenced by so many variables that are mostly outside teachers’ control—socioeconomic status of their students, home life, amount of time kids have to study, kids’ active and passive vocabularies, and so on—that creating a test, especially a standardized test, that is truly reliable is kind of a pie-in-the-sky goal. Nobody expects these tests to be truly reliable, simply because there are just too many variables.

Test designers try to eliminate variables, but their efforts are limited to the test-taking environment. They provide word-for-word scripts that test proctors are supposed to say to kids during the testing, instructions that no kid is to be provided with any assistance in answering a question during the test, lists of tools and aids teachers can provide for kids during the test, including posters on the wall in the test room, etc. But all of this “standardization” of the testing environment only goes so far. It can’t touch many of the more important variables that influence kids’ performance on the test.

Think of it like this: A lawyer has to pass the bar exam to practice law in a courtroom. The bar exam is reliable mainly because it’s comprehensive. Any random sample of so many questions is most likely to produce about the same score by a person as any other random sample of questions simply because the sample size of questions is so large. It’s like taking a poll before an election: the more people you ask, the more certain you can be that your estimate for how the entire population will vote will be close.

We can’t do that on a test for third graders, because asking too many questions would just take too long and kids wouldn’t be able to sit still in a controlled testing environment long enough. That would completely ruin any hope of controlling other variables that are under teachers’ control. In New York last year, on a test aligned to the Common Core, one of the chief complaints was that the test was too long and many kids were unable to finish it. There’s that reliability attempt rearing its ugly head.

So, we have to pick just a few questions for each of the major areas in the standards. In third grade, for instance, any one student may get six to nine questions about math “operations and algebraic thinking.” A different third grader will get a different set of six to nine questions. Reliability means that if both of these kids have the same level of mastery in math operations and algebraic thinking, they’ll get the same score on that subset of six to nine different questions.

That’ll never happen.

The reason behind this being such an unreachable goal is found somewhere in the variables that affect kids’ scores on standardized tests. So, MSDE can call the test “reliable,” but we have to keep in mind, that’s only reliable within the limits of testing 9-year-olds.

Fair

For a test question to be fair, it means that a boy and a girl, with the same level of understanding about the topic of the question, are equally likely to get it right. Or, an African-American boy and a Hispanic girl have an equal chance of getting it right. And so on.

There can be no biases, and the only way to determine if a question has a bias is to ask kids from different subgroups to answer it on a field test. If a higher percentage of boys than girls, say, answer a question correctly, assuming the sample size is large enough and no other variables are involved (which is asking a lot), the question is biased against girls and would be thrown out.

The theory is that no biased question should count toward a kid’s score.

This has made for some rather dull reading passages on standardized tests, because it has been found that boys engage at higher levels with questions on certain topics, while girls engage better with questions on different topics. The result has been, in many cases, reading passages that aren’t really about any topic.

The Common Core has added social studies and science literacy, whereas many state standards didn’t have this piece before the Common Core. I believe literacy in science and history is a good thing, but the primary purpose for putting these standards into the Common Core was obviously to allow reading passages on tests that are not only fair to all students but at least a little interesting.