Assessment Needs to Be Worth the Hassle

Another day, another revise-and-extend job on a Twitter comment...

Dec 05, 2023

Generated from the prompt “department chair observing a college class,” and it’s surprisingly good except for the way people are facing in all different directions…

For whatever reason, The Chronicle of Higher Education’s newsletter this week looked at student course evaluations and found them wanting in the usual ways. There isn’t any new research presented in this— the most recent link in the piece is from July— but I guess it’s being posted now because the end of the Fall semester will have the topic on faculty minds.

This was shared around a bunch, with the usual range of comments from the very reasonable point that these shouldn’t be the sole basis for evaluation of faculty, to the very unreasonable claim that the true nature of education is ineffable and thus any attempt to measure it is foolhardy if not outright evil.

My contribution to the social-media traffic on this was a tweet in response to Tyler Austin Harper, which I’ll screenshot because nobody clicks links any more:

To expand a bit on my point here, I think a lot of the claims made in this debate are over-broad, particularly when it comes to the difficulty of measuring effective teaching. While I will agree that it’s not a simple problem, I think we do have a whole suite of ways to measure whether particular educational methods work as advertised, and even whether a particular professor is doing their job well. They’re just an enormous hassle to implement.

The funny part here is that a lot of the studies cited when people talk about the badness of student course evaluations are doing exactly the thing that people citing the studies say is impossible. They’re doing actual measurements of how much students learned in particular courses, and correlating that with the student evaluation numbers and grades to show that, if anything, high scores from student evaluations are associated with less learning. Which is a strong argument precisely because they’ve got more than just vibes to back it up.

These are tricky studies to do, though, requiring either some kind of standardized testing of student skills and knowledge independent of a particular class, or some longitudinal evaluation of performance in later courses. And a boatload of statistical analysis. It’s not anything that’s going to be implemented for routine evaluation of classes or faculty.

Or, for another example, you can look at what we do in cases where the stakes are high: our reappointment and tenure evaluation processes. The system at Union has tenure-track faculty go through two comprehensive evaluations of their teaching, research, and service activities, a reappointment review in the third year, and then the tenure review in the sixth year. Both of these are high-stakes, up-or-out processes: faculty who don’t pass the review are given one year to wrap things up, then they have to leave the college.

The teaching review is similar for the two processes: the faculty candidate has to provide a comprehensive portfolio of teaching materials— syllabi, class notes, assignments, exams— and a statement explaining their approach and commenting on their experiences. Other faculty in the department will observe their classes, and the review committee will interview all the other faculty in the department and at least 20 students from past classes to get their impressions.

That’s a ton of stuff, and having gone through both reviews as a candidate and having been on more review committees than I care to think about, I have a good deal of confidence that this is an effective and reliable way to assess whether someone is doing a good job of teaching. It’s also a gigantic pain in the ass, which is why we only do it for reappointment and tenure reviews, when somebody’s job is literally hanging in the balance.

This is obviously not the kind of thing that can be done at scale, for all faculty on a routine basis. Honestly, it stretches our capabilities a bit to do it even just for reappointment and tenure reviews— nobody wants to be on those committees if they can possibly get out of it.

If you want to do routine evaluations at scale, you need something that doesn’t require anywhere near the amount of effort we put into the high-stakes reviews. The problem is that putting in less effort will get you a less reliable measure of the outcomes you really care about. But, more or less by definition, the stakes of these routine reviews are also low— nobody’s getting fired if they don’t excel— so low precision should be acceptable.

The problem with the common (mis)use of student course evaluations is that all too often they’re used in a way that attaches high stakes to a low-precision instrument. I don’t think they’re completely useless— “Did students enjoy their experience in this class?” is, in fact, something we should care about— but they should never be the only thing considered, and shouldn’t have significant stakes attached. And they shouldn’t be used to compare faculty to one another, only as a kind of longitudinal check on an individual’s performance over time. If there’s a sudden drop (or a rapid rise) in the evaluations for the same professor teaching the same course, that’s probably worth looking into with some higher-precision, higher-effort techniques.

If you’re talking about tying significant raises to faculty evaluations, though, student course evaluations don’t even begin to cut it. Especially if there’s any zero-sum competitive aspect to the evaluation, where one person getting a raise means someone else doesn’t. There are too many well-documented problematic biases affecting student course evaluations for them to be used in that manner.

The key is to match the level of effort put into a measurement to the consequences associated with the outcome. Student course evaluations are super-low-effort— you send students a web link or pass out a form, and that’s it— which means they should only be used in a super-low-stakes way. A sudden drop in scores for a given class by a given professor probably calls for some reflection on their part, and maybe a check-in from the chair. A significant and sustained drop probably calls for some class observations and the like, to get a more complete understanding of what’s going on. But nobody should be losing out on money based only on evals, let alone at risk of losing a job.

But I would also note that this goes both ways. It’s wildly inappropriate to use a low-effort, low-precision tool for high-stakes evaluation, but it’s not a whole lot better to use a high-effort tool to do low-stakes evaluation. You end up wasting the time and energy of faculty (especially department chairs and the folks with an overdeveloped sense of responsibility who end up on all the committees) on measurements that are literally worth nothing. If you want there to be stakes, you need to put in effort, but if you want people to put in effort, there damn well better be stakes.

I’m all in favor of removing student course evaluations from high-stakes processes, and replacing them with more precise assessment methods, even if it requires more effort to do so. I’m strongly opposed, though, to moving to high-effort methods for routine evaluations with low or no stakes, but I’m more than a little afraid that the faculty tendency to prefer elaborate structure is going to push us in that direction.

That may or may not clarify what I was going for with the tweet. If you want to see whether I have to revise and extend further, here’s a button:

And if you want a low-effort way to share your assessment of this, the comments will be open:

Leave a comment

Derek Catsam

Dec 12, 2023

One of the problems becomes faculty complicity in the thing we claim to hate. Everyone knows all of the problems with teaching evals. but we are all willing to use them as at least a crude tool. I think of teaching evals a bit as being akin to the SATs. I know what the margins mean -- a kid who gets a 1600 is no dummy, a kid who got a 400 is likely to struggle in college, but tell me to differentiate in a meaningful way between 1050 and 1250 and I really don't think anyone could say with a straight face that they know what it means.

Same with teaching evals. On a traditional 1-5 scale, 5 being good, if someone is averaging 4.7 all the time, we basically say they are fine -- great, even! -- unless there is something else that stands out. If someone is consistently averaging 2.3 we have a problem. But what does a 3.3 mean -- especially if conscientious students are saying (3 says average -- this person was fine! Give 'em a 3!) Making matters worse, on some scales 3 is not a midway point, but it connotes neutrality, which lowers the average of good people even if the student is not making a value judgment, ditto raising the value of poor performers.

Expand full comment

Evan

Dec 5, 2023

I generally agree with this take, and especially with the point that high-effort data collection needs to be justified by high-stakes outcomes -- a lesson that could be applied far beyond academia. (Ask your doctor how much of their time they spend on insurance paperwork.)

That said, I do have a quibble with the implication that high-quality data collection must be onerous. It usually *is* onerous, but a lot of the time it could be made far less painful than it is. Again looking at the doctor's office: How often have you been handed a packet of forms where you fill in the exact same data repeatedly in slightly different formulations? That isn't a necessary part of the process. It's just wasted effort created by our fragmented health care system.

Quality data collection is always going to require more effort at some point. But if more of that effort were invested in designing an efficient process (which includes integrating it with existing processes), a lot less would be required to carry it out.

1 reply

5 more comments...

Counting Atoms

Discussion about this post