Setting NAEP Performance Standards: How We Got Here

NAEPTesting

Cross-posted at Education Week.

Last week, I wrote a blog commenting on the draft policy on performance standards recently issued by the National Assessment Governing Board.  In it, I called for performance standards based not on people’s opinions about what constitutes basic, proficient and advanced performance at three grade levels but on what the evidence shows it takes to succeed in the first-year program of a typical community college, on the grounds that, for at least half of our high school graduates, our community colleges are the gateway to both careers requiring occupational certificates and to two-year and four-year college degrees.

Among the responses I got was one from Jim Pellegrino, who sent me two papers he thought might interest me.  One was the chapter on setting performance standards in a book that he, Lee Jones and Karen Mitchell had done for the National Academies in 2009 titled Grading the Nation’s Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. The other was a paper by Albert Beaton, Robert Linn and George Bohrnstedt written for the National Center for Educational Statistics in 2012 titled Alternative Approaches to Setting Performance Standards for the National Assessment of Educational Progress (NAEP).  They make for very interesting reading for those of us who care about the usefulness of The Nation’s Report Card.

Reports from the National Academies are usually written in the most measured scholarly language imaginable.  Not this one.  The smoke rises from its pages.  NAEP’s standard-setting process, it says, is “fundamentally flawed.”

Reviewing critiques of the NAEP standard-setting process offered through the 1990s, the chapter from the National Academies book quotes analysts who described the process of making judgements a “nearly impossible task” for the raters, pointed out that the process produced different cut scores for different kinds of test items (e.g., open ended vs. multiple choice) and said the cut scores had been set at levels that were simply not credible when compared to evidence from other well-regarded assessments.

All these issues came to a head with the 1996 science assessment.  The results showed that, at all three grade levels, very low percentages of students scored proficient, and, at the high school level, hardly any students had made it into the advanced level.  The obvious conclusion was that students who had earned good grades on the Advanced Placement tests in science were not considered by NAEP to have achieved at advanced levels.  One would also have to conclude that students who had done very well on the Trends in International Mathematics and Science Study (TIMSS) were not doing advanced work either.

These results were not all that unusual.  In other cases, too, the NAEP findings on the performance levels of American students seemed to be way off in both directions.  The standards were just not credible.

So the Board adjusted the cut scores to make them more credible.  But then the critics noticed that the new cut scores did not match up with the performance descriptions for the standards.  What was defined as proficient work in the descriptions was not what was tested by the items used in the proficient range. So the Board took out the definitions!  Well that was one way to deal with the discrepancy.  Then one could say a student was proficient if he or she scored in a certain score range, but one could no longer say what that meant in terms of what the student could do.

The Board empaneled another group to write new descriptions of what the performance levels meant.  This time they wrote descriptions not based on what a student should know to be proficient, but rather on descriptions of what they currently know and can do. “…[I]nstead of reporting achievement results relative to an established standard of performance… the science report presented results that were based on NAGB’s a priori judgment as to what constituted reasonable percentages of students at the three achievement levels.”

The problem with that, of course, is that the public might reasonably think that a student who was rated proficient on a subject at a certain grade level by NAEP was able to do what a student needed to do to be successful according to expert judgment, but it did not mean any such thing.

What had really happened was that a complex technical process that was supposed to produce findings about student achievement against common sense standards had failed badly.  The process had produced findings showing that, in some cases, the standards were much too high by any reasonable measure and in other cases much too low.  In the end, the NAEP Board did what such bodies had always done before in such cases.  It adjusted the results to produce a politically palatable result without a solid rationale for its decision.  It had been and was still the case that it was very unclear what the performance standards were or what they ought to be.

The problem, as I pointed out in my last blog, is that the nation was much more focused on finding accurate, unfudgeable measures of student performance with which to measure the performance of state education systems, districts and schools than ever before.  The issue of what the performance standards meant and how they should be developed would not go away.

You might reasonably assume that a slashing attack like this from such a distinguished group of critics would have led the NAEP Governing Board to respond to the critique with a fix for its standard-setting problems.  But that did not happen.  Years later, the process for setting the standards had not changed substantially, so another, no-less-distinguished band of scholars took up where Pellegrino and his colleagues had left off.  Beaton, Linn and Bohrnstedt were members of the NAEP Validity Studies Panel, now chaired by George Bohrnstedt himself.  Their paper was written as part of that program of studies.

In it the authors explore three possible alternatives to the process for setting NAEP achievement levels I described above.

The first alternative would be to make the achievement cut points predictive.  The cut point for the end of elementary school assessment would predict the likelihood of success in middle school, the cut point for the end of middle school assessment would predict the likelihood of student success in high school, and the purpose of the end of high school assessment would be to predict the likelihood of success in college and career.

The second alternative would be to “benchmark the achievement levels against international standards.”

Their third alternative was to use percentile rankings to set base-year norms against which progress could be measured in succeeding years.  The authors acknowledge that hybrids of these approaches could be developed, too.

I was astonished when I read this paper.  Taken together, its proposals mirror the plan I described, in more detail, in last week’s blog.  But those ideas are not new.  I first proposed them years ago when my organization created a program called Excellence for All to put them to the test in the field in several states including Kentucky, Arizona and Mississippi.  The assessments we used were not the NAEP assessments–they were not designed to be used as census assessments of all the students in a school–but the International General Certificate of Secondary Education exams offered by Cambridge Assessment International Education of Cambridge, England.

The idea was to set a high standard for all high school students based on what it would take to succeed in the first year of a typical community college program and design the program of the high school so that most students would reach that standard by the end of grade 10 and almost all would get there by the time they graduated high school.  Students who reached the standard by the end of grade 10 would be able to enroll in a demanding upper division high school program like International Baccalaureate, a whole program of AP courses or the Cambridge Diploma program, all of which are designed to qualify students who get good grades on those programs into the world’s most selective colleges and universities.  Or they could take a full program of community college courses and wind up either with a strong vocational credential or two years of college credit, ready to transfer at the end of high school straight into the junior year of a state college or university.

We needed a team of top education researchers to help us set the right pass points on the Cambridge exams to make this program work.  We asked Jim Pellegrino and Howard Everson to chair the Technical Advisory Committee.  The other members were Catherine Snow, Phil Daro, Bob Linn, Richard Duran, Ed Haertel, Dylan Wiliam, Joan Herman and Lloyd Bond.

Beaton, Linn and Bohrnstedt recommend that NAEP do an empirical study to benchmark college and career readiness.  We did that, with a research plan approved by the members of this Technical Advisory Committee (TAC) and drawing on the services of several of its members.  Beaton, Linn and Bohrnstedt recommended that the NAEP high school performance standards be set to predict the likelihood of student success against the empirically determined college and career benchmark.  We did that, using the Cambridge exams, rather than the NAEP assessments, also under the supervision of the members of our TAC and using their services.  Beaton, Linn and Bohrnstedt recommended that NAEP benchmark international student performance standards and design NAEP performance standards to predict student success on those benchmarks, too.  We did that, too, using instead of the NAEP assessments the Cambridge assessments.

The only difference between what we did and what was recommended in the paper was that we substituted Cambridge examinations for NAEP assessments.  There is no reason why NAEP could not replicate what we did substituting NAEP assessments for Cambridge assessments.

Not only that, but, because we have already done the work, we know what an empirical study of college and career readiness, defined as readiness for success in the first year of a typical community college program, requires.  We also know that this is pretty much the same performance standard that is met by the typical student who is entering gymnasium in Europe or beginning to take “A”-level exams in England.

The point is that a very good model exists and has been tested for producing performance standards for NAEP that would address and resolve the problems in setting performance levels that have dogged NAEP for years.

This is no esoteric matter.  For all the reasons I advanced in last week’s blog, NAEP is this country’s last redoubt for honest comparisons of the performance of a state’s education system to that of other states.  It is certainly true that performance could just be reported as a number on a scale.  But what does that number mean?  What does it say about what the students know and can do?  About whether they are ready for the next stage of their education or to begin a rewarding career?  How their performance compares to the performance of students of the same age in other countries?  If you care about the answers to these questions, you should care about the way NAEP sets its performance standards.

The changes now proposed by NAEP to the standard setting process do not address these issues.  If you think the issues I have raised are important, write to the NAEP Board with your comments.