One of my least favorite regularly-deployed aphorisms is “Not everything that can be counted counts, and not everything that counts can be counted” (which you can find attributed to the usual dizzying array of people, but probably should be credited to William Bruce Cameron). The problem isn’t so much with the content of the phrase, which contains at least as great an element of truth as anything else you could fit in a fortune cookie, but its use. It’s regularly deployed in a rhetorical way to assert the impossibility of measuring the true essence of whatever pet project the speaker is going on about at the time. And as an experimental physicist by training and temperament, I have a reflexive negative reaction to assertions of that sort.
I was reminded of this, and prompted to write about it today, thanks to this tweet from Kieran Healy:
As I noted when I quote-tweeted this, it’s a succinct statement of something that drives me nuts about many academic discussions of assessment. Which take lots of different forms— not just “Which of these research papers is the best?” or “Which of these departments is the best?” but also things like “Which of these faculty candidates should we hire?” and “Which of these students should be admitted to our college?”
These kind of decisions are absolutely central to the functioning of academia in its modern form, and as Healy’s tweet notes, are also closely related to core functions of scholarship in general. Looking at two presentations of ideas about whatever area and saying “This is crap, but that is great stuff” is what we do, in a very deep way. If it were truly impossible to judge “everything that counts,” it would be impossible to do much of anything that is recognized as the job of a faculty member at a college or university. In fact, the whole reason we have arguments about things like hiring and admissions is because people on both sides are making judgements about which candidates are most worthy, just coming to different conclusions.
This topic was already in the back of my mind thanks to a recent Substack post by Timothy Burke that I liked enough to do the fancy embed thing here:
This isn’t a vehemently anti-assessment post, but as the title suggests, is appropriately conflicted about something that is genuinely a difficult subject. There are decisions associated with these questions that need to be made— who to hire, who to admit, how to allocate resources within the institution— and those need to be done in the best manner possible. Asserting the impossibility of assessment and judgement doesn’t help get those essential jobs done.
One way out of this that both Burke and Healy make use of is to fall back on the idea of expert judgement as a special thing, deserving of extra deference. The idea being that the problem isn’t judgement per se, but simplistic or algorithmic judgement by outsiders. People within a particular area by dint of their professional training have expertise that allows them to make qualitative judgements of things that simply cannot be assessed with any accuracy by those without that training, and those expert judgements should be accepted without too much effort to quantify them.
That’s a line of argument that makes me a bit uncomfortable for a couple of reasons. The biggest is that physicist-by-training-and-temperament thing again: I’m not wild about the idea of expertise as some ineffable quality that can’t be explained or quantified. There’s also a problematic tendency for the range of who counts as an expert to expand and contract as needed to serve the interests of particular actors: most faculty tend to take a very narrow view of who counts as an expert able to judge the quality of their scholarly work, for example, but are less inclined to show any deference to the expertise of professional staff in admissions or student life.
(There’s also a bit of irony in the way that many of the problems that people prone to complaining about quantitative assessment feel strongly about are known to be problems precisely because of those quantitative measures. The well-known biases against faculty from underrepresented groups in things like student evaluations of teaching are well known because they’ve been quantified and replicated many times. But that’s getting a bit afield…)
There’s another issue here, though, which is that the argument about whether this or that aspect of academic work can be measured is often rhetorically conflated with the question of whether measuring it is worth the effort to do so in particular cases, or on particular time scales. Burke gets at this a little bit in his post, but it’s not spelled out quite as clearly as I’d like, so I’ll try to do that here.
My go-to example for this is the evaluation of teaching, which contrary to a lot of claims, I think we know how to do reasonably well in a systematic way, because we do it for reappointment and tenure reviews at Union. The process is very similar for both our third-year review and the tenure review: in addition to student course evaluations, we collect narrative statements from the faculty member under evaluation, materials from all their courses, classroom observations from their chair or other colleagues, interviews with other faculty, and interviews with a (hopefully) representative sample of students from their courses. These materials are reviewed by other faculty within the department (who are at least believed to have relevant disciplinary expertise) and at least one committee of faculty from outside the department.
That’s a very labor-intensive process, but I would be willing to defend it as providing an accurate assessment of who is a good teacher and who is not. It’s an appropriate process to use at those high-stakes points of a faculty career, where the decision is really “up or out.” When careers are at stake, that really demands that you put in the effort needed to do it right.
In parallel with the reappointment and tenure pipeline, we also have a “merit evaluation” system that helps determine annual salary increments. The process used there is much more cursory—pretty much just student evaluations and comments from the chair—and widely believed to be seriously flawed. I would not, however, argue that it should be replaced by the reappointment-and-tenure version of the evaluation, because the stakes just don’t justify the effort involved— nobody’s going to lose their job because of these, and the pool of money available for “merit” raises in any given year is not that large. I would argue (and have) that rather than moving to a more accurate but more labor-intensive system, we should just scrap the “merit” evaluations altogether.
With that in mind, I’m largely in agreement with Burke that a lot of academic assessment should go for a lighter touch and a longer time scale. That would also argue for lowering the stakes of as many decisions as possible.
That doesn’t mean, though, that I’m opposed to quantitative assessment per se; if somebody could identify a low-effort way to do it that was in reasonable accord with the existing labor-intensive methods, then I’d be open to it. We don’t have that now, but I’m not convinced that this is impossible even in principle, to use a physicist-y phrase. A lot of what we have now is, to my mind, best seen as part of a series of fumbling attempts to come up with accurate low-effort assessments of the core functions of academia, and I’m generally in favor of continued effort and experimentation in that direction.
So, yeah, I’m sure this will be regarded as a definitive but uncontroversial take, and we can all move on to more enjoyable topics. If you’d like to read those, here’s a button:
If you’d like to spread this take to others, here’s another:
And in the unlikely event that you don’t feel this has completely settled the matter, the comments will be open.
I think the one thing that really pushed its way through my head in writing that is that the expert judgment we have embedded in our self-assessment is less about our training or our particular scholarly expertise and more about what my colleagues Ken Sharpe and Barry Schwartz discussed in their book Practical Wisdom: we have the expertise of having done a particular work for a long time and having a professional ethos that drives most of us to want to do it better. It's absolutely right to say that this kind of expertise has some significant limitations--every profession in the world defends itself in these terms when it's criticized (cue the Teddy Roosevelt quote: "It is not the critic who counts" etc.) but there are blindspots in every profession's work routines. (Among them, how the people who receive their services feel about the quality of what they received.) But surely experience should count for something, and yet, a lot of assessment work only brings experientially-based self-assessment by faculty into view in limited ways, and discounts its value for the most part.