Tag Archives: test scores

Value-added assessments

I posted about value-added assessments when the front page story in the New York Times came out early this year. In recent weeks I’ve come across a couple interesting commentaries on these scores.

At the Washington Post, Jay Mathews wrote a column titled “Devaluing value-added assessments.” I read it closely, but couldn’t understand what Mathews is saying is wrong with these scores. He begins by saying he will relate “the best argument against value-added I have seen in some time.”

Point #1:

“I have seen this sham firsthand over many years,” Wiggins writes. “Lots of so-called good N.J. and N.Y. suburban districts are truly awful when you look firsthand (as I have for three decades) at the pedagogy, assignments and local assessments; but those kids outscore the kids from Trenton and New York City, even though both city systems have a number of outstanding schools and teachers.”

I don’t get this–don’t value added scores only measure changes within a single district? Aren’t we only using them to assess teachers within districts?

Point #2:

Also, Wiggins wrote, valid research on value-added exposes “hidden truths,” such as “it IS true that models accurately predict over a three-year period, performance at the extremes. Thus, the really effective teachers stay so and the really ineffective ones are really ineffective.”

I don’t understand this at all. What is the hidden truth here exactly? That teachers matter?

Point #3:

Schools with high test scores discover through value-added analysis that they need more than that. One outstanding prep school, Wiggins said, gave a professionally designed test of critical thinking to freshmen and seniors. There was no improvement. Similar results have come from colleges giving the Collegiate Learning Assessment of analytical skills, given to freshmen and seniors.

Huh? It sounds like Mathews is saying here that value added scores help schools identify bigger problems. Isn’t that a good thing?

Point #4:

Our mistake was thinking this valuable long-term research tool would work as a one-year teacher rating system. “It becomes like a sick game of telephone: What starts out as a reasonable idea, when whispered down the line to people who don’t really get the details — or don’t want to get them — becomes an abomination,” Wiggins wrote. “By looking at individual teachers, over only one year (instead of the minimum three years as the psychometricians and VAM [valued-added model] designers stress), we now demand more from the tests than can be obtained with sufficient precision.”

I’m not sure what to make of this. It sounds like the critique is that the VA measure only uses change over one year. I suppose that would be problematic if true, but I’m not sure it is true. Even if it is, the paper by Chetty et al. (subject of the NYT article linked above) offers evidence that VA measures are an unbiased measure of quality.

A second commentary comes from Andrew Gelman’s blog. This is more of a technical discussion about whether VA measures make the right modeling assumptions.

Potpourri

  • “Top 0.1%, By Zip Code”
  • “Big Super PAC donors: Same old guns, just more money”
  • Causal effects of the Head Start program
  • New York Times mentions confidence intervals (in the context of value-added teacher ratings)
  • Social benefit of obesity: less crime?
  • “Spacing Children Farther Apart Benefits Older Siblings”

    Via the New York Times, this post at the Freakonomics blog caught my attention:

    A new study (PDF here) by University of Notre Dame economist Kasey Buckles and graduate student Elizabeth Munnich finds that siblings spaced more than two years apart have higher reading and math scores than children born closer together. The positive effects were seen only in older siblings, not in younger ones.

    The NYT post doesn’t address the selection issue–that those who choose to space their children apart may just be “better” parents than those who don’t. But the actual paper, and the Freakonomics post, does: the authors take advantage of the fact that some families wait between births due to factors beyond their control, i.e. miscarriages. This from the paper’s abstract:

    However, because we are concerned that spacing may be correlated with unobservable characteristics, we also use an instrumental variables strategy that exploits variation in spacing driven by miscarriages that occur between two live births. The IV results indicate that a one-year increase in spacing increases test scores for older siblings by about 0.17 standard deviations—an effect comparable to estimates of the effect of birth order. Especially close spacing (less than two years) decreases scores by 0.65 SD. These results are larger than the OLS estimates, suggesting that estimates that fail to account for the endogeneity of spacing may understate its benefits.

    Interesting stuff. So my question is whether the authors think they are going to convince policy makers that they should come up with incentives to influence birth spacing?

    “Crystal clear” correlation between long school days and student outcomes is not causal

    Scott Lehigh in the Boston Globe: On charter time: A longer school day transforms low-income kids into high achievers.

    Drawing on recent state test score (MCAS) data, Lehigh notes that charter schools, which have longer school days than public schools, are doing better in terms of scores. He then muses that

    At this point, several things should be crystal clear to everyone.

    First, more learning time can transform low-income kids into high achievers. Second, charters, which offer a significantly longer day for the same per pupil expense, are a bargain for taxpayers. Third, incremental change in the traditional schools will no longer suffice.

    But there are two big problems. One is that there are presumably several things about charter schools that distinguish them from public schools, beyond the length of the school day. No attempt is made here to separate out the independent effect of school day length. More importantly, there is the huge selection effect: students who enroll in charter schools are different, probably in terms of being more motivated to achieve, than those who don’t.

    Ok, so I’m just shooting fish in a barrel, right? But it still seems problematic that this is what passes for informed debate when it comes to education policy. I don’t assume to know how education policy develops, but it seems safe to assume that the Boston Globe editorial pages are an important factor.

    NY Times article on single-sex education kerfuffle

    A group of education scholars and psychologists is crying “pseudoscience.”

    While some studies have found better outcomes from single-sex schools, the article said, the purported advantages disappear when outcomes are corrected for pre-existing differences. For example, Chicago’s Urban Prep Charter Academy for Young Men, a school whose high college admissions rates were praised this year by Secretary of Education Arne Duncan, was subsequently criticized by the scholar Diane Ravitch as having test results that were actually lower than average on basic skills.

    “This is very much a live issue, and I think it’s snowballing,” said Galen Sherwin, a staff lawyer for the Women’s Rights Project of the A.C.L.U., who is handling the Louisiana case. “I see news stories every single week about new proposals, usually based on the idea that boys and girls learn differently. Often it’s people who have attended training programs by Sax or Gurian, saying these programs will cater to boys’ and girls’ specific learning styles.”

    More here.

    NYT story on classroom tech and student outcomes

    The district leaders’ position is that technology has inspired students and helped them grow, but that there is no good way to quantify those achievements — putting them in a tough spot with voters deciding whether to bankroll this approach again.

    “My gut is telling me we’ve had growth,” said David K. Schauer, the superintendent here. “But we have to have some measure that is valid, and we don’t have that.”

    It gives him pause.

    “We’ve jumped on bandwagons for different eras without knowing fully what we’re doing. This might just be the new bandwagon,” he said. “I hope not.”

    I think that about sums up the story. But we can throw in this excerpt too, which seems to be the extent of the actual reportage on the (lack of) data:

    Many studies have found that technology has helped individual classrooms, schools or districts. For instance, researchers found that writing scores improved for eighth-graders in Maine after they were all issued laptops in 2002. The same researchers, from the University of Southern Maine, found that math performance picked up among seventh- and eighth-graders after teachers in the state were trained in using the laptops to teach.

    A question plaguing many education researchers is how to draw broader inferences from such case studies, which can have serious limitations. For instance, in the Maine math study, it is hard to separate the effect of the laptops from the effect of the teacher training.

    (Emphasis added.) Interesting to contrast the spending on technology with the cuts to teachers, and the role of accountability and evaluation in each case, which the article also does at times.

    Letters to the editor on test scores as measures of teacher quality

    The New York Times recently printed some letters to the editor on using test scores to evaluate teachers (in particular, in New York state). I’m just getting to these now–they were published in late May.

    I think the letters are interesting. The range of arguments and claims is suggestive that people really have no idea how they should be evaluating teachers.

    Letter writer number one says the problem is not the teachers, but the principals who need to become better at observing teacher quality:

    If principals cannot figure out whether teachers who work for them every day are effective, then we need different training for principals, not more tests for students.

    Letter write number two thinks test scores are preferable to letting principals make judgments on their own, since this can lead to favoritism:

    Such tests enable administrators to directly measure results, without using more subjective criteria that often involve a preference for a particular teaching methodology, or, worse, the subjective impressions of administrators who may have an issue with a particular teacher. Such tests can also measure how much and how well a student has learned over a school year.

    Letter writer three thinks that the costs of testing, in the form of more class time spent on teaching to the test, outweigh the benefits.

    Letter writer four thinks there should be a form of peer evaluation, akin to what is done in universities:

    Perhaps we ought to consider as well another time-tested method of assessing teachers. Since medieval times, university professors have been essentially evaluated by peers. If peer evaluation is added to the principal’s assessment and student performance, a more accurate evaluation of teacher effectiveness would be achieved.

    Letter writer five thinks that students will try to sabotage teachers they don’t like by purposely flunking the tests, since the students don’t have any incentive to pass them.

    Judging teachers–and journalists–with data

    The debate over which is the dominant factor in educational outcomes–teachers or family background–continues to play out on the nation’s opinion pages. (See these earlier posts.) The latest piece is from Diane Ravitch in the May 31 New York Times. I find this piece interesting because it seems to suggest that the numbers can be sliced any way to support any point of view. Ravitch writes:

    To prove that poverty doesn’t matter, political leaders point to schools that have achieved stunning results in only a few years despite the poverty around them. But the accounts of miracle schools demand closer scrutiny. Usually, they are the result of statistical legerdemain.

    Ravitch then goes through a handful of claims of improved schools, followed by isolated cases of these claims being debunked. For example:

    In 2005, New York’s mayor, Michael R. Bloomberg, held a news conference at Public School 33 in the Bronx to celebrate an astonishing 49-point jump in the proportion of fourth grade students there who met state standards in reading. In 2004, only 34 percent reached proficiency, but in 2005, 83 percent did.

    It seemed too good to be true — and it was. A year later, the proportion of fourth-graders at P.S. 33 who passed the state reading test dropped by 41 points. By 2010, the passing rate was 37 percent, nearly the same as before 2005.

    What I think is ironic is that while Ravitch criticizes others for using “statistical legerdemain”–sleight of hand, for you non-English majors–she is actually engaging in it herself here. From these few cases of what she argues are inflated claims, she wants to draw a broader inference about educational outcomes:

    What is to be learned from these examples of inflated success? The news media and the public should respond with skepticism to any claims of miraculous transformation. The achievement gap between children from different income levels exists before children enter school.

    Sorry, it doesn’t work like that. If you want to tell me about the impact of pre-educational factors on educational outcomes, give me some evidence; don’t just give me a couple of cherry-picked cases (about a different independent variable, no less).

    (Actually, this isn’t that ironic. Proponents of quantitative and formal work in my field, political science, frequently make the case that statistics and mathematical analysis are just means of explicating what would otherwise be implicit assumptions. So the argument Ravitch puts forth just fits this pattern.)

    This piece does seem to clarify the debate, such as it is, though. On one side are those calling for data-driven approaches to evaluating teachers. On the other side are those who think evaluating teachers based on the available data is a bad idea, because (1) the available data, test scores, are not a good measure of the intangible quality of teacher performance, and (2) family background factors are more important than what happens in the classroom.

    I’m sympathetic to the teachers’ point of view–that is, the non-data driven point of view, which I often see espoused by teachers in these pieces. But I’m puzzled as to how we are supposed to evaluate teachers, if not with the data we have. Is there any hope for a middle ground? New measures, perhaps?

    So what does this post have to do with journalists? I’m trying to kill two stories with one post. The second story is from today’s Globe, and is an unsigned editorial. It begins:

    SIGNIFICANTLY LONGER school days are a hallmark of urban schools where students excel despite their socioeconomic disadvantages. But what of the schools that offer more-of-the-same mediocrity during the extended day? State Education Commissioner Mitchell Chester delivered the tough, but fair answer: No more funding from the state.

    Here we have an example of the Globe’s editorial board calling for some “tough, but fair” accountability based on data. The editorial makes me wonder, how would the Globe like it if they were judged in a similar fashion, using hard data? Of course, as a private organization, the Globe’s finances are carefully monitored, and its performance is judged based on this. But what about the more intangible outcomes, that we think newspapers should be delivering? For example, an informed public debate? Or a vibrant civic life?

    Now think about whatever it is you do. Would like to be evaluated in the same manner that teachers would be, under the data-driven approach?

    You could argue that teachers provide one of the most essential services in our economy, and so different standards apply. But I think journalism is pretty high up there as well. (Which makes you wonder why both are paid so little–but that’s another post.)

    What drives student performance (again)?

    I’ve written about this before, but Joe Nocera’s column the other day still bugged me.

    Going back to the famous Coleman report in the 1960s, social scientists have contended — and unquestionably proved — that students’ socioeconomic backgrounds vastly outweigh what goes on in the school as factors in determining how much they learn. Richard Rothstein of the Economic Policy Institute lists dozens of reasons why this is so, from the more frequent illness and stress poor students suffer, to the fact that they don’t hear the large vocabularies that middle-class children hear at home.

    Number one, that social scientists have contended something since the 1960s is not at all informative. Number two, “unquestionably proving” something is not what social scientists do. We actually don’t prove anything (when dealing with the empirical world, anyway), just disprove alternative explanations. Number three, the sad thing is that–and this is just my impressions–neither social scientists, educators, or the general public really know what makes students learn, or which factors are more important than others.

    Number four, what strikes me as a little odd about this debate is that when it comes to bad outcomes, reformers seem to want to emphasize the role of factors outside of the school, like home life. But what about good outcomes? If Nocera is right, doesn’t this mean we shouldn’t credit teachers with helping students learn? That maybe paying them so little is a good thing? I don’t agree with that, but it seems to be a corollary of his claim.

    And finally, more causal nihilism.

    What needs to be acknowledged, however, is that school reform won’t fix everything. Though some poor students will succeed, others will fail. Demonizing teachers for the failures of poor students, and pretending that reforming the schools is all that is needed, as the reformers tend to do, is both misguided and counterproductive.

    Can’t we all just get along? I’m all for admitting the limits of our knowledge, but I really don’t agree with the idea that we can’t get a better sense of what makes students learn–and that we should simultaneously assume we already have it figured out, “unquestionably.”