Tackling Bias in Automatic Speech Recognition - Two Examples From Our Ongoing Work
Rosy Southwell is a postdoc research scientist at CU Boulder who holds a PhD in Cognitive Neuroscience from University College London, UK and an MS in Natural Sciences from University of Cambridge, UK.As part of iSAT, Rosy works on automatic speech recognition and processing to help extract as much information as possible from noisy audio recorded in the classroom.
Dr. Wayne Ward is a Research Professor at CU Boulder whose research involves applying supervised machine learning to the tasks of automatic speech recognition, dialog modeling and extracting semantic representations from speech and text. His recent focus has been on applying these technologies to questing answering and virtual tutoring systems.
AI systems that are designed to offer real-time classroom support need to be able to understand what students are saying—and do so with high accuracy. This requires Automatic Speech Recognition (ASR), which is the process where spoken language is automatically converted into text.The text can then be used by an AI to understand how students are working together.
A key consideration when developing an AI system is how it is trained and the data it learns from. In the context of speech recognition, the AI is trained on a large collection of audio recordings from many different speakers. These systems have become a lot more accurate in recent years, especially for adults from particular demographics (native English speakers, white, US accent), but this does not reflect the diversity of speakers in the world, of course.
The question is: how will an AI perform in a classroom setting where it is mostly children and teenagers who are talking? They may come from diverse backgrounds, speak in a variety of accents, and use gen Z slang. In our work, we have found this domain to be significantly challenging for existing speech recognition systems—in part because it is still very uncommon for children's speech to be used for training an AI system. Let’s discuss two variables in our data where ASR shows its weaknesses: age and race.
First, let's look at how we can adapt ASR to work better for students of all ages. We have a lot of training data from adults and elementary school students, and a small amount of test data from 9th graders. The word error rate (WER), which is the percentage of words that get transcribed wrongly by the model, can help us figure out what’s going on. For models that have been trained on adult speech, WER is 8% for adult speakers, but on our evaluation set of 9th grader speech, WER reaches up to 56%. In other words, the ASR gets it wrong more than half of the time! The WER for elementary school kids’ models when tested on kids of the same age is about 9%, but for 9th graders it jumps to a whopping 46%. This shows that models trained on one age group do not really generalize well to a different age group.
We can make improvements bystarting with adult models and using a process called “fine-tuning” where a model goes through additional training on different data to adapt it to a new domain. Fine-tuning on elementary school kids’ speech did improve the WER slightly to 41%. To address the scarcity of training data for specific age groups, we are working on new techniques to adapt models to different age groups using very small amounts of age-specific data.
Second, there is concern within the AI community about “accuracy biases” in AI systems that can disadvantage certain demographics such as non-white speakers. As a team we have often discussed bias in AI models, identifying places where AI could be affected by bias, and how we can mitigate this. In some of our recent work, we found that a popular ASR tool on which we base automatic feedback for tutors is 24% less accurate for Black speakers when compared to white speakers because the acoustics of their voices are not as well understood by the AI. Without access to the data they used for training the AI, one likely reason for this accuracy bias is that the model was not shown enough speech from Black speakers when the AI was trained. If an AI can't "hear" individuals accurately, then this has consequences for its ability to provide helpful feedback! We usedfine-tuning to reduce the accuracy gap between Black and white tutors by around a third, and also improved the ASR accuracy for both groups of tutors. But from just these two examples, it is clear that there is still a lot more work that needs to happen to overcome these bias issues in the future!