Linn and Everson on Testing, Standards and Accountability

Cross-posted from Education Week

Bob Linn is often described as the dean of American testing experts.  Howard Everson, long the Vice-President for Research at the College Board, is one of the nation’s top psychometricians.  I asked the two of them to draw on their nearly half century of experience in this field to reflect on the current state of testing in the light of the recent emphasis on student performance standards and the growing strength of the accountability movement.

Marc Tucker: Over the last 15 years, each of you in succession has chaired the New York State Technical Advisory Committee, responsible for advising the state on testing issues.  New York State has, since the Civil War been home to the Regents Exams, an essay-based examination system originally designed for courses taken by elite students.  More recently, New York State has also been home to a more conventional multiple-choice system of accountability testing.  Both have been evolving.  Have they been sitting comfortably side by side?
Howard Everson: The TAC had been raising questions about the reliability of the scores from the Regents for a long time, but New York State officials did not really focus on those issues until the accountability movement placed much higher stakes on the results.  These issues were exacerbated in the case of the Regents by the fact that the teachers, not the state, had control of the content and scoring of the exams.   At the same time, the Regents were concerned that the multiple choice tests were not rigorous enough, not measuring the higher cognitive demands associated with college readiness or the Common Core State Standards.
Bob Linn: It has always been my view that there are just some things you can’t do with multiple-choice test items.  Having kids construct their own answers has important advantages.  But there can be reliability and comparability issues when teachers score them.  I should note, though, that the College Board, to its credit, has worked hard, and I think successfully, to increase reliability in its Advanced Placement program by training the scorers properly.  That shows that it is possible to have a reliable course-based system of assessment with human beings scoring the exams.

MT: So how do the two of you think about reliability and validity now, in the light of this history and the current demands on testing?
HE: I would add to the tradeoffs among validity, reliability and cost, the stakes that get attached to the tests.  When Bob and I were on the TAC in the early 1990s, the stakes were shifting and changing and they are continuing to do so.  As the stakes rose, we were pushed to emphasize reliability more, validity less.

MT: Do these high stakes push you towards multiple-choice?
HE: Yes. It is not just the technology of the testing itself that changes over time; the most important change has been in the way tests are used here in the United States.

MT: If we look globally at the kinds of assessments used by top-performing countries, we see testing systems based largely on curriculum that are similar in design to the Regents and the College Board Advanced Placement tests—heavily essay-based with some multiple-choice items, but mostly essay.  Those countries are doing very well.  In their systems, there is not much test-based accountability for teachers but there is a great deal of test-based accountability for students.  Their exams probably don’t meet some of the psychometric standards for reliability we have in the United States. Where does that leave the two of you on where we should be going as a country?
HE: There has been a lot of litigation on testing in the United States and I think it has forced us to emphasize test score reliability over validity.
BL: One big difference between the United States and other countries is the prestige and trust in teachers, which is very low in this country and tends to be quite high in the top performers.  This has led to the development of accountability systems that use external measures to see if schools and individual teachers are doing a good job. This has morphed into the next level: evaluating individual teachers.  Unless we can find a way to increase the prestige of teachers and public confidence in them, it will be hard to move too far away from using testing for these purposes.

MT: If we were to rely less on tests for reliability purposes, do you think we would be able to develop tests that will do a better job of measuring the kind of higher order thinking skills that lie at the heart of the Common Core State Standards.
BL: Yes. I think that is true.

MT: What is your advice on the right balance between validity and reliability, especially if we want to embrace the goals implicit in the Common Core?
HE: I think the importance of reliability has been overblown.
BL: I agree it is less important than comparability and validity and fairness.  It would be highly desirable to go where the two state testing consortia want to go. They want to include, in addition to multiple-choice items, items where kids are required to do things, solve problems and show how they come up with solutions to the problems they are given.  But the realities of timing and cost are pushing them in a direction that will likely force them to come up short.

MT: Is this country getting ready to make a profound mistake?  We use grade-by-grade testing in grades 3-8 but no other country is doing it this way for accountability; instead they test 2 or 3 times in a students’ career.  If the United States did it that way, we could afford some of the best tests in the world without spending any more money.

BL: Raising the stakes for our test-based accountability systems so that there will be consequences for individual teachers will make matters even worse.  Cheating scandals will blossom.  I think this annual testing is unnecessary and is a big part of the problem.  What we should be doing is testing at two key points along the way in grades K-8, and then in high school using end-of-course tests.

HE: I am in the same place as Bob.  The multiple-choice paradigm first used in WWI and eventually used to satisfy the NCLB requirements has proven to be quite brittle, especially when applied in every grade 3-8 and used to make growth assumptions.  The quick and widespread adoption of multiple-choice testing was in hindsight a big mistake for this country, but—now — states will tell you it is all they can afford.