Bias detected in Bands of America trait scoring

Tuesday, May 4, 2010

Introduction

I’m almost reluctant to say anything about this, since what some of our best educators try to do with marching band performances has absolutely nothing to do with comparing one school’s performance to that of another. Charles Staley, fine arts chair at Neuqua Valley High School in Naperville, Ill., for example, once told me marching band was a “leadership training activity. The musical experience is secondary to the team building marching band promotes. Competitions have their place in a marching band season, but naming a … champion promotes those programs that place too much of an emphasis on marching band. Administrators [at those schools] push music programs to win.” And lest you think Neuqua Valley isn’t doing a top-notch job of educating its 4,447 students, just be aware that the Fine Arts Department was awarded a Grammy in 2009.

However, our schools spend millions of taxpayer dollars on marching band and related activities, travel, fees, props and equipment, licensing fees for the music and text used in the shows, and so on. Each band student devotes hours of time and mountains of energy to his or her school’s marching band in the pursuit of performance excellence. The most publicly visible segment of the high school marching band performance circuit is a series of “festivals,” where one band performs after another, usually in 14- or 15-minute intervals, sometimes for an entire day that starts for the teenagers before the sun rises and goes until long into the night. They all name a “champion,” so the scoring of bands in an effort to name such a champion is worth talking about, at least a little.

The Bands of America organization, an Indianapolis-based nonprofit, puts on huge festivals—not only “regional” and “super-regional” competitions but also a scoring contest it pretentiously calls the “Grand Nationals.” It is certainly the biggest marching band spectacle in America. BOA—as well as every competition that licenses the BOA adjudication scoresheets, which is a high number of them, to be sure—in effect bases its ranking of bands at these competitions on scores given by judges, some of whom are down on the field, and some of whom are in the press box or at a higher vantage point.

BOA calls its system of scoring a “criteria referenced” one, which actually brings to mind “criterion-referenced” standardized tests. By using a technical term, it misleads us into believing that judges objectively measure a band’s score against a published set of criteria, known and understood by educators. If it were truly a criterion-referenced system, however, bands that meet all the criteria at the high school level would get perfect scores. And then, we wouldn’t be able to call one school a grand champion, since so many of them would likely meet any organization’s criteria at the high school level.

To use another example, there’s no champion SAT test taker. According to the College Board, 297 students got perfect scores last year on the test. The results would be similar for bands at a competition, in that several bands would fall into the same score category. That’s how criterion-referenced (or norm-referenced in the case of the SAT) scoring systems work. Only a few make it to the top level, but the scoring system doesn’t have much use when it comes to naming a single champion.

I have to conclude, then, that the BOA organization is simply misusing a scientific term in an attempt to confuse us so we support their scoring system, which in fact is nothing like any criterion-referenced system our educational establishment has ever seen. But here, I’ll try to analyze it as such, since in education, this type of system has worked fairly well.

The use of a technical term like that, out in the open, invites a scrutiny of the criteria that the guidelines purport to reference. I have space for just one such set of criteria a judge might use at a BOA competition: Music Performance (Ensemble). A judge can award up to 200 points in this category, which counts for 20 percent of the band’s final “score.” Of those 200, 50 come from the trait (i.e., scoring subcategory) of musicality, which includes expression, full range of dynamics, phrasing, and style/idiomatic interpretation.

Between 46 and 50 points should be awarded in this subcategory to a band whose performers “constantly display the highest level of control and concept of musicality,” the BOA Adjudication Handbook says. “The performers maximize the technical and artistic aspect through clear, meaningful and expressive shaping of musical passages as evident with proper and uniform expression/dynamics. There is a natural, well-defined and sensitive display of playing throughout with valid, tasteful phrasing and idiomatically correct interpretation achieved in a consistent manner.” Bands whose students do this “often” or “occasionally” instead of “constantly,” or that “lack a fundamental awareness of the musical program … the basic elements of musicality,” get lower scores in this subcategory.

Speaking as an educator who deals with student standards of learning at the state and federal levels, I see big problems with this scoring rubric. First of all, what can I or anyone else say is the “highest” level of musicality—the highest on a given day of competition, the highest I’ve ever seen or heard, or what? The use of the superlative in the scoring criteria statement itself invites comparisons between one band and another, which directly contradicts the reason bands work so hard to give an excellent performance in the first place.

Second, how does a judge know what students “understand” or what they have an “awareness” of? All a judge really knows about any student or the program in general is what he sees and hears during a single eight-minute performance. The rubric’s language bases the judge’s decision almost entirely on what students “understand,” which doesn’t necessarily correlate with what is demonstrated during the one performance.

Third, at the high end of the score distribution, we see words like “tasteful,” “meaningful,” and “clear.” Maybe a band wants to use music or include musical elements in their show that are purposefully distasteful to a majority of the population. Even though they give a high-quality performance, their score will suffer, at least if judges apply the published rubric correctly. Furthermore, a research paper about subatomic physics would be “clear” to a college professor, for instance, but would go right over the head of the average high school student. Like tastefulness, the clarity of something depends almost entirely on the qualifications and expertise of the person doing the listening, unless that something is completely garbled. In addition, the meaningfulness of a rock ‘n’ roll tune, say, might not be as great to a college music professor as Paul Hindemith’s “Sonata for Trumpet and Piano” would, yet the correct application of the rubric by that judge will again cause high-quality, hard-working programs to get lower scores on the basis of their musical selection.

Inclusion of simplistic and dumbed-down language like this in the scoring rubric for one of the finest high school musical performance organizations in the world not only is embarrassing but also reflects a lack of interest in promoting good education, good musical performance and understanding, and good character development. It turns the process into one of personal opinion by using language that renders the score a function almost entirely of the judge’s experience. It inevitably leads to errors and bad judgments on the part of judges, and the very fact that thousands and thousands of America’s students participate in these activities and millions and millions of dollars are spent each year, especially in a time when schools need more money, causes us to scrutinize critical aspects of the program.

For a moment, take a step back. Although I’m not directing this analysis at high school kids, they are fundamentally at the core of this issue. Yes, they get excited about performing in domed stadiums or big NFL stadiums, which is something BOA provides like no other organization. They get excited about marching at Memorial Stadium in Champaign, Ill., the place where a marching band took the field for the first time, in a 1907 game between the University of Chicago and the University of Illinois. Yes, they learn a lot about how to improve themselves from the comments given to them by qualified judges and educators. They grow every time someone encourages them with their applause, and these festivals have sometimes 40,000 screaming fans of all bands. That alone probably makes the entry fees, bus rides, hotel rooms, and other expenses good and justifiable investments.

But they also talk about how they wonder why a certain band did or didn’t make the finals, whether it was unjust that one band was named grand champion at a competition while another was left in third place. These thoughts about the contest may not exactly be what educators want them to focus on, but these thoughts are a fact of every band teenager’s life. That’s what kids talk about when it comes to these marching competitions. What I’ll try to do here is answer those questions about why bands fall in a certain order in the rankings. And I’ll show how the scoresheet doesn’t reflect a “scoring” system at all but a flawed “ranking” system, through a detailed analysis of score data from the BOA Grand Nationals last November in Indianapolis.

Analysis of some data

The table below lists the 34 bands that performed on Saturday at last year’s BOA Grand National Championships. In other words, these are the bands that made the semifinal cut coming out of prelims on Thursday and Friday at Lucas Oil Stadium. The bands are shown with the percentile ranking among just these 34 bands, using the score given to them by the Music Performance (Ensemble) judge during prelims and semifinals. Note that these scores represent two different performances and scoring by two different judges. The right-hand column on the table shows the change in percentile ranking from prelims to semifinals. A positive change indicates the band was placed higher among these 34 bands in the semifinals than it was during the prelims.

Semis Performing Order	Music Ens. (Prelims)	Music Ens. (Semis)	Change (+/-)
Springs Valley H.S. IN	3%	0%	-3%
Williamstown H.S. KY	6%	9%	+3%
Saint James School AL	0%	3%	+3%
Reeths-Puffer H.S. MI	9%	27%	+18%
West Bloomfield H.S. MI	30%	30%	0%
Lake Central H.S. IN	15%	21%	+6%
Bellbrook H.S. OH	36%	21%	-15%
Walled Lake Central H.S. MI	39%	18%	-21%
Plymouth-Canton Ed. Park MI	58%	61%	+3%
Center Grove H.S. IN	73%	45%	-27%
Avon H.S. IN	97%	97%	0%
William Mason H.S. OH	42%	64%	+21%
Owasso H.S. OK	27%	55%	+27%
Columbus North H.S. IN	21%	36%	+15%
James Bowie H.S. TX	76%	64%	-12%
Marcus H.S. TX	100%	97%	-3%
American Fork H.S. UT	64%	48%	-15%
Lawrence Central H.S. IN	82%	79%	-3%
Carmel H.S. IN	85%	88%	+3%
Ben Davis H.S. IN	67%	36%	-30%
Broken Arrow Sr. H.S. OK	91%	70%	-21%
The Woodlands H.S. TX	70%	82%	+12%
North Hardin H.S. KY	52%	55%	+3%
Bourbon County H.S. KY	18%	52%	+33%
Marian Catholic H.S. IL	88%	91%	+3%
L.D. Bell H.S. TX	94%	76%	-18%
Centerville H.S. OH	61%	73%	+12%
Lafayette H.S. KY	55%	82%	+27%
Lake Park H.S. IL	48%	30%	-18%
Wando H.S. SC	79%	91%	+12%
West Johnston H.S. NC	12%	42%	+30%
Seminole H.S. FL	33%	6%	-27%
South Brunswick H.S. NJ	45%	15%	-30%
Forsyth Central H.S. GA	24%	12%	-12%

The majority of data shown is unremarkable. Most bands achieved about the same ranking among the 34 bands during the semifinals as they did during the prelims. However, a few anomalies in the data stand out. The scatterplot below shows this data graphically. The red line is a theoretical construct, which shows where a point would be plotted if a band didn’t increase or decrease in Music Performance (Ensemble) position from prelims to semifinals.

Semi-final Placement vs. Prelim Placement

Consider Bourbon County High School’s Music Performance (Ensemble) score. The band’s point is plotted at the 18th percentile on the horizontal axis, which represents the band’s placement in the prelims among these 34 bands, yet that point goes all the way up to the 52nd percentile. This indicates that Bourbon County’s score was equal to or higher than 17 of these 34 bands in semifinals yet higher than or equal to only six of them during prelims. It’s hard to believe that between their performance on Friday and their performance on Saturday, the band all of a sudden got better than 11 bands that were better during prelims.

This is the kind of anomaly that drives scoring analysts nuts, and I am one such scoring analyst. The scores I analyze aren’t usually marching band scores, but I do analyze criterion-referenced standardized test scores for a state department of education. Let me just say that if a school had 33 percent more of its students scoring proficient in reading in one year compared to the previous year, there would be an investigation. Actually, even before there was an investigation, the state or district would probably fire the principal and then conduct a formal investigation. The anomaly in the data would be that suspicious.

The stakes aren’t quite as high at the Grand Nationals, but the data are no less anomalous. That is, it could happen, but the odds of winning the lottery are probably better.

In order to understand Bourbon County’s sudden increase in Music Performance (Ensemble) ranking, we need to look at performance order, one of the leading causes of bias in judging, not only at Bands of America contests but on statewide standardized tests as well. During the prelims, Bourbon County performed immediately after Marian Catholic from Chicago Heights, Ill., the only seven-time Grand National champion at Bands of America. During the semifinals, when Bourbon County received a much higher placement, they performed right before Marian Catholic.

One factor in the change in Bourbon County’s placement in Music Performance (Ensemble) could be a form of rater bias known as “scoring sequence bias.” For many scoring analysts, this is not a form of bias in itself but rather a side effect of what is known as “central tendency bias.” This occurs when a rater, say, of your essay questions on a statewide standardized exam, gets a class of papers where only one student writes anything close to the correct answer. As he reads papers that are completely off-base, he keeps thinking to himself that they couldn’t possibly be that bad and starts thinking they are closer to the center of the score continuum than they really are. Then, upon reading the only paper in the bunch that has a semblance of a clue, he tends to give it a higher score than it really deserves, simply because of the essays he read before it which he thought had to be closer to the middle—in other words, better—than they really were.

Central tendency bias works both ways, too. If a series of good bands performs, a judge could tend to think they couldn’t be all that good and start moving his expectations toward the middle. If a middle-of-the-road band then comes along, the band is likely to get a score below the median because the scorer’s opinion of what an “average” performance is has been significantly elevated by his thinking that all the preceding good bands were really closer to the center of the scoring scale than they really were.

Depending on the band’s order in the performance, since judging marching bands is subject to the same human nature as scoring essay questions on standardized tests, we could see effects from the bands that came before it. This gives rise to the common expression, “They’re a tough act to follow.”

Consider L.D. Bell High School, once a Grand National champion themselves but last year put in second place. The L.D. Bell band performed immediately after Marian Catholic in the semifinals and had a change in placement of 18 percentile points in the negative direction, compared to their placement during the prelims among these same 34 bands.

This is not to dispute the naming of Avon High School as last year’s Grand National champion, but it leads to questions about a “Marian Catholic” effect, where the band that performs immediately after Marian Catholic will tend to get a lower score in Music Performance (Ensemble) than they would if they had given exactly the same performance but followed a band that was not as superior as Marian Catholic in terms of musicality. Data definitely suggest that this is not a “criterion-referenced” system at all but rather an “other-band-referenced” system.

Of the eight bands that performed before L.D. Bell in the prelims, for example, only one even made the semifinals. The other seven were completely out of the running. As with a decent essay after a scorer sees a whole classroom’s worth of bad writing, the Music Performance (Ensemble) score might have been higher than L.D. Bell deserved during prelims, or because of the Marian Catholic effect, their score might have been lower than they deserved during semifinals. The anomaly in the data, to which I limit my analysis here, is simply that L.D. Bell’s placement among these 34 bands was significantly different during prelims and semifinals.

Analysis of other points in the data reveals similar bias trends. For example, the prelims judge was not as impressed with the music of West Johnston High School from Benson, N.C., as the semifinals judge was. A possible influence on the prelims judge was that the band from Marcus High School in Flower Mound, Texas, was coming off the field just as West Johnston was walking on. Marcus was a tough act to follow, I’m sure, based on the fact that the band got the top score for music at the Grand National Championships among all bands. During semifinals, on the other hand, Avon, Marcus, Marian Catholic, and L.D. Bell were all long out of the mind of the Music Performance (Ensemble) judge when West Johnston took the field.

In addition, data for other BOA traits show similar biases. Some of the analysis can be found by following the links below to our direct stories covering the bands that performed in the semifinals at last year’s Bands of America Grand National Championships. I’ve also made the Excel spreadsheet I used to conduct this preliminary analysis available online, here. Simply save the target of this link and open it with Microsoft Excel 2007 or higher (it’s an XLSX file).

Reducing the likelihood of scoring bias

A truly criterion-referenced “scoring” system would provide exemplar performances for each score point. Take Music Performance (Ensemble), for example. A true scoring system would provide examples at least of performances that should receive a musicality subcategory score of 46, 36, 16, and 6 (the critical lines on the rubric). Then, judges would be trained using these performances, which illustrate the key lines in the rubric, not using a rubric itself that makes an effective mockery of criterion-referenced systems. If such a system could be established, it would ensure, as much as possible, that a judge would know what a band sounds like that should get a 40, say, for the musicality trait. Yes, each band will use different music, but real musicians and adjudicators would know how to isolate the important, score-driving elements of the performance and compare them to the exemplars. This way, each band could be compared against a standard, published by BOA and understood by educators, rather than to other bands performing that day. If we want consistent scoring, we should at least try to show judges what a “40-musicality” looks like in a prototypical performance.

Such an approach may not be practical, though. It would require frequent “grounding” of judges in the exemplary performances for each score point. They would have to stop in between bands and check their score against the exemplary performances every time, and the comments they provide on those tapes are much more important to the education of marching band students. Getting the score right just isn’t important enough to devote what would be required in a truly criterion-referenced system.

Those comments must be good, since schools continue to pay lots of money in order to get them. I’ve listened to a few tapes, and I think my evaluation would be that BOA adjudicators review bands better than Roger Ebert reviews movies. Their comments are extremely helpful and certainly promote growth in student achievement. Since these comments are so valuable, would it make sense to split the adjudicator role into two parts or two people? One could be called a “critic” or “reviewer” while the other called a “scorer” or “rater.” Then, even though the score isn’t important in promoting development, the outward sign at competitions—naming of a champion, for instance—would have the appearance of due diligence and some semblance of fairness.

What else can be done? As much as possible, judges need to re-ground themselves on the criteria. This is a criterion-referenced system—or at least, in an ideal world, it would be—and judges need to keep referencing said criteria. They need to erase, as much as possible, all memories of other bands during the day. The magic secret, known to all mass-production test scorers, is to score it and forget it. They can’t let one band affect the score on the next, or it’s not truly a criterion-referenced system. Now, they’re human, so this isn’t entirely possible, but we should train judges in the ideal behavior of judging and hope for the best.

BOA could also use a completely random performance order for semifinals. This would produce the same odds that one school will experience the Marian Catholic effect as any other. In a truly criterion-referenced system, bands would not perform based on size, enrollment, general effect score during prelims, or any other measurable data, such as the time at which an application was received.

Patterns definitely emerge in the performance order of bands in the semifinals at the Grand Nationals and at other contests. Consider general effect scores, for example. The first eight bands performing in the semifinals were first, second, third, fourth, 14th, sixth, seventh, and eighth from the bottom during prelims in their general effect score. This gives the appearance of non-random selection, despite Bands of America’s claim (see Grand Nationals program, p. 28) that “Semi-finalist directors draw for performance times after Friday night’s awards ceremony.”

The probability of a random draw coming out with the lowest-scoring band first, etc., for the first eight bands out of 34, where the fifth position doesn’t matter, is about 1 in 27 billion. We note that Bands of America never used the word “random” in their description of how order is determined. If they had, the press would have serious questions for the organization to which our schools, our corporations, and our citizens send $6.2 million annually.