Here is a meter test I put together many years ago, but never got around to administering. I forgot all about it until recently when I was going through some old files and found the test and this blank answer sheet. Each musical example is in either duple or triple meter. Students are instructed to listen to each example once and then circle the correct word: Duple or Triple.
Here are the audio files of the test items.
Haydn – Symphony #100 – “Military”, 2nd mvmt
Johann Strauss – Tales from the Vienna Woods
The Beatles – “Norwegian Wood”
Bizet – Carmen Suite – Habanera
Bernstein – West Side Story – “I Feel Pretty”
Bach – Toccata In D, BWV 912
Kelly Clarkson – “Breakaway”
Beethoven – Symphony #9, 4th mvmt
The Marcels – “Blue Moon”
Haydn – Symphony #103, “Drumroll”- III. Menuet
Verdi- Rigoletto – “La Donna È Mobile”
Mozart – “Eine Kleine Nachtmusik” 1st mvmt
James Taylor – “Sweet Baby James”
Mahler – Symphony #1, 2nd mvmt
Handel – Parnasso In Festa – Chorus
Schubert – German Dance in B-flat, D. 783 #7
Bizet – Carmen – Overture
Verdi – Il Trovatore – Anvil Chorus
Handel- Water Music, Suite #2 – Minuet
Banjo solo – Cherokee Shuffle
Seal – “Kiss from a Rose”
Bach – Goldberg Variations – Variation 18 (woodwind transcription)
Saint-Saëns – Carnival of the Animals – Fossils
Bach- Mass In B minor – Gloria In Excelsis Deo
Prokofiev – “Classical” Symphony, 3rd mvmt
Billy Joel – “Piano Man”
Schubert – Moment Musical #3 in f minor
Rankin Family – “Mo Shall al dhekh”
Stravinsky – Suite for Chamber Orchestra, # 2- Waltz
The Del-Vikings – “Come Go with Me”
Haydn – Symphony #101 “The Clock”, 2nd mvmt
Handel – Brockes Passion – “Gott selbst der Brunnquell”
Saint-Saëns – Samson et Dalila – Bacchanale
Smetana – Bartered Bride – Act 1, Scene 1
Molly Mason & Jay Ungar – “The Mountain House”
Gotye – “Somebody That I Used to Know”
Bizet – Carmen Suite No. 1- IV. Seguidilla
Simon & Garfunkel – “America”
Scarlatti – Sonata in G Major, K. 455 (synthesizer)
Johann Strauss – Blue Danube Waltz
I called the test the “original version” because I planned to write at least a few revisions, depending on the numbers it generated. My intent was to give it to my older students (4th grade and up), as soon as they were exposed to division patterns in duple and triple meters. But I got bogged down in other things, and this test stayed buried in my computer hard drive. During the past week, just on a whim, I administered the test to 75 students in 4th grade. The results were fascinating! But before I talk about the results, let me talk about the test for just a minute.
Because it’s a 40-item test, and each item has two options (duple or triple), the chance score is 20, the theoretical mean is 30, and the theoretical standard deviation of 3.33. If you look closely at the way I ordered the test items, you’ll see that no more than 2 duple items (or 2 triple items) appear in a row. I also rigged it so that if students tried to be sneaky by pattern-marking with alternate responses (duple-triple-duple-triple-duple, etc, or triple-duple-triple-duple-triple, etc,) they would get a chance score of 20.
Here are some of the preliminary results: The raw scores range from 20 to 38. (Two students out of 77 scored below the chance score, but I removed these outliers from the results, leaving N=75.) The mean is 28.91 (slightly lower than the theoretical mean) and the standard deviation is 4.38, (greater than its theoretical counterpart), which basically means that the test is slightly more difficult than it ought to be, and there’s too much variability around the mean; in simple terms, too many kids scored too close to the chance score of 20.
The test could stand some tweaking, but still, it’s pretty damn fine as it is. The items, taken together, represent a great variety of styles and genres; and each item has its own shape, so to speak. Most of the musical excerpts end on a cadence; only rarely did I end an item in the middle of a phrase. Although most of the musical excerpts are instrumental, I was careful to avoid putting two instrumental items of the same timbre and time period together. To avoid placing two vocal items back-to-back, I deliberately interspersed the vocal items with the instrumental. Most of the vocal items ended up being in triple meter. I tried to find pop music examples in triple, because they’re so rare, and maybe I got carried away. This didn’t pose a problem. The item difficulty levels (which I’ll discuss in greater detail later) reveal that some vocal items were easy while others were difficult. In other words, students were not able to guess the meter by the genre.
The test has its problems. Not only is it too difficult for its intended population, but it’s too long. With 40 items, it clocks in at exactly 30 minutes. (I had the kids finish the test in two sessions, so that fatigue would not set in.)
So, how should I go about shortening the test? One way is to shorten the length of each item. You’ll notice that the items are between 34 and 45 seconds long; in the next version, no test item will exceed 30 seconds. I suppose I could shorten the test by making the items only 15 or 20 seconds long, but I don’t want to do that (even though I observed most students circle their answers to most test items after roughly 15 seconds). Students may be able to audiate the meter of a 15-second segment of music. But I want the test to be a rich, aesthetic experience for them, and not just a test of audiation skill. In short, I refuse to sacrifice my students’ aesthetic experience for brevity. Thirty seconds is as brief as I’m willing to go.
Another way to shorten the test is simply to remove test items that are too easy, or too difficult, or those that simple don’t reveal much difference in knowledge between high scorers and low scorers.
In columns B and C, you can see the item difficulty levels (Df) and the item discrimination values (Ds).
First, I scored the papers; then I ordered them from lowest to highest; then I separated the lowest 27 percent and the highest 27 percent, a practice recommended by Robert Ebel in his book Measuring Educational Achievement. Twenty-seven percent of 75 is 20. In other words, I took 20 of the lowest papers and 20 of the highest papers, 40 papers in all to “play with”. Then I tallied all the correct answers from each paper.
The procedure for calculating item difficulty is as follows:
Add the number of correct responses from the lowest 27% to the number of correct responses from the highest 27%. Then divide that sum by the total number of test papers you’re analyzing.
The procedure for calculating item discrimination values is as follows:
Subtract the number of low scorer correct responses from the number of high scorer correct responses. Then divide that difference by half the number of test papers you’re analyzing. Keep in mind that you’re still using only the top and bottom 27% of the total number of test papers.
(Some readers may catch that I use formulas that are slightly different from those recommended by Gordon and Walters. They don’t follow the 27% rule.)
Let me use Item #1, Haydn’s Symphony #100, as an example. From among the low scoring group of 20, 16 kids got Item #1 correct; from among the high scoring group of 20, 18 kids got Item #1 correct. I added 16 and 18, got 34, then divided 34 by 40, the total number of test papers under investigation. The result was .85, meaning that Item #1 is a fairly easy item.
I then subtracted the low scorers (16) from the high scorers (18), and then divided that number (2) by half the number of test papers I’m using (20). The result is .05, which means that Item #1, though it’s a positively discriminating item, doesn’t show a clear distinction between the high and low achievers.
Item discrimination is a bit complex. If many high-scoring kids get a test item wrong, that’s not a deal breaker; and if many low-scoring kids get a test item right, that item might still be a keeper. But… if most of the high scoring kids get an item wrong, while most of the low scoring kids get that same item right, that spells trouble: The test item is not doing its job, which is to differentiate between high and low achievers.
If test items fall in the range of difficulty from 60 percent to 90 percent, I will retain them. If they fall outside that range, I discard them (or at least most of them). And I want only positively discriminating items with a value of at least .20, with most of the items greater than .40. If an item has a discrimination value lower than .2, I will, with few exceptions, discard it.
In short, I want a range of item difficulty from .60 to .90, with most items hovering around .75. And… I want most items to have discrimination values of .2 or greater. Hooray! That’s basically what I got!
What’s next for this test? A revised version with kick-ass items that retain their difficulty levels and high discrimination values. And then, finally, I will calculate its reliability. Will improved test items result in high test reliability? Will shortening the test impact negatively on reliability? We’ll all have to wait until the next school year to find out!
To finish this long but geekily fascinating blogpost, I’ll discuss a few test items that fell in the extremes, why I’m chucking most of them, and why I’m keeping a few for the next revision.
Bach’s Goldberg Variation 18 and Schubert’s Moment Musical are obvious items to discard (with zero discrimination values, and difficulty levels that make the test tougher than it ought to be). Same with Mahler’s First Symphony and the Kelly Clarkson song (even though the kids enjoyed it). Out they go. The Stravinsky Waltz fascinates me: it falls in the middle range of difficulty; but, as a negatively discriminating item, it works against the overall test. More low scorers got it right than high scorers! With its oom pah pah underlying beat, it makes sense that low scorers got it; but how could so many high scorers miss it?! We’ll never know. And Haydn’s “Military” and “Clock” Symphonies? What could scream duple more than a clock or a military march?! The low scorers tended to get them. But high scorers? Hello? Where did you disappear to? At any rate, out go those items. (But not out completely. Because they’re easy items, I’ll use Simon and Garfunkel’s “America,” and Haydn’s “Military” Symphony as pre-test examples in the next revised version of the test.)
And then there’s the Bach Gloria from the Mass in B minor. I’m keeping it. Yes, its discrimination value did not meet my criteria; but it’s a fairly difficult item, and items on the extreme ends of difficulty sometimes have low discrimination values. It’s a good item. I’m keeping it. End of story. Same with the Del Viking’s “Come go with me.” It’s an easy item in a test that could benefit from a few more easy items. It’s staying. Even though I love the Rankin Family’s “Mo Shall al dhekh,” I can’t really justify keeping it; except that if I cut it, the test will be unbalanced. I’m getting rid of 8 items—4 duple and 4 triple. The Rankin Family song stays, at least until I can replace it with a more highly discriminating item in triple meter.
Now let’s look at a few items that are worth their weight in gold. Mozart’s Eine Kleine Nacht Musik. Who could have predicted that this piece of Classical Top 40 would make my top 40? It’s a really difficult item that discriminates so well between those who “get” duple meter and those who don’t. And the Scarlatti Sonata! And the Banjo solo! I never would have foreseen their high discrimination values. And hats off to that Belgian-Australian singer-songwriter Gotye for his song “Somebody that I Used to Know.” I couldn’t have asked for a better test item. And it puts a smile on the kids’ faces when they hear it.