Sophie Quigley is a professor of computer science at Ryerson University who specializes in human-computer interaction. In 2009, while serving as the Ryerson Faculty Association’s grievance officer, she grieved the use of faculty course surveys in employment related decisions such as promotion and tenure. Nine years later, in June 2018, the action finally resulted in an arbitration award declaring that faculty course surveys are flawed and discriminatory, and are not to be used to evaluate teaching effectiveness for promotion or tenure at Ryerson. They can, however, continue to be used to provide information about the student experience.
The history of the dispute over the use of student questionnaires, or “student evaluations of teaching” at Ryerson goes back many years — even before the grievance. What happened over the course of those years?
Even before faculty course surveys became a grievance, they were the subject of bargaining with the administration. Prior to 1994, the original student questionnaire was used mostly for formative purposes, although it could be made available to tenure committees upon request. In 1994 the questionnaire was formalized into the collective agreement which also mandated the inclusion of the responses in annual reports. As a result, it started being regularly relied upon during evaluations for tenure and promotion; but comparisons were not drawn between different faculty members because there was no information to support this practice.
Administration of the questionnaires went from paper to online. What was the effect?
Putting the questionnaire online changed everything. When it was administered on paper only, faculty members received results by course section only. After the switch in 2007, not only were the questionnaires administered online, but results were posted online as well. The administration began producing online reports which included multiple summative tables of averages. Every faculty member now received a page summarizing their own averages: per section, per course, per person. There were also tables of averages for departments, faculties and the university as a whole. Finally, the response rate dropped precipitously, which made the results even less valid. It continues to drop.
What was wrong with producing averages?
Two things: our collective agreement specifies that reports are to be framed as frequency distributions, in other words tables counting the number of responses received in each answer category of each question. This is because frequency distributions give an accurate picture of the responses, unlike averages which are not statistically sound in this situation, and can be misleading. The collective agreement also specifies the material that can be used during tenure evaluations, and it does not include material that supports comparing an individual’s teaching performance against other colleagues’ performance.
Mathematically then, these averages were incorrect?
These averages are statistically unsound, and that was the core of the grievance. The course survey used a five point Likert scale — most questions had five possible answers ranging from “strongly agree” to “strongly disagree.” In the original paper-based administration of the questionnaire, students’ answers were encoded with numerical labels 1 to 5. However, as was explained by our expert witnesses during arbitration, these numbers are ordinal, i.e. they are labels that simply represent the order of the five answers, but they do not represent an interval scale which is an ordered list of numbers representing quantities. Averaging numbers on an interval scale provides useful information in some situations, but averaging ordinal numbers is simply nonsensical and this is unfortunately what is often done with course surveys. The Ryerson survey was also averaging answers that had nothing to do with each other — for completely different courses, for example — with different material, different enrolments, and different students in different years and different programs. Finally, the averages were not even calculated correctly: in multi-section courses, the section averages were simply averaged into a single course average without weighing them based on number of respondents in each section. To this day we don’t even know how the departmental, faculty, and university-wide averages were calculated.
How were these averages being used by administrators?
As though they were semi-sacred, as extremely meaningful. They ascribed significant meaning to minute numerical variations, comparing personal averages with aggregate averages, creating tenure requirements based on specific averages. We saw teaching assessments that included nonsense such as “in this department/faculty we expect members to have a 2.5 score on Question X and your score is 2.6 so you will need to improve your teaching to meet our tenure criteria.” We first thought that the administration would agree that this type of shallow analysis was dangerous and harmful. We tried to enlist their help in correcting the situation and educating assessors on the limitations of this instrument. However, it eventually became clear that not only did the administration not agree with us, but endorsed this kind of analysis. So we filed our grievance.