Fine-tuning a Strong Language model to Enable Classroom Speech Recognition
Postdoctorate Viet Anh Trinh led a project within Strand 1 to develop a novel neural network architecture that can both recognize and generate speech.ÌýHe has since moved on from iSAT to a role at Nvidia, where he is continuing his work on multimodal Large Language Models.
Understanding and processing speech in classrooms can be difficult because it comes with its own set of challenges such as background noise, people’s unique speaking styles, changes in pitch, and differences in the content of speech. Kids also don’t talk like adults, which means that existing speech recognition models don’t really work well for classroom speech, especially when there isn’t a lot of labeled data available for model training. To tackle this, we useÌýunsupervised machine learning–-a type of machine learning where a computer learns patterns from data without being given explicit instructions or labeled examples. This helps reduce the need for large amounts of labeled training data while at the same time better capturing how kids speak and communicate. This approach holds lots of potential for future applications in education, healthcare, and more—enabling more inclusive AI systems tailored to younger users.
Our speech processing team has been working on a Discrete Multimodal Large Language Model (DMLM), which is capable of flexibly translating data across modalities to perform various speech processing tasks. DMLM is one of the first discrete token-based decoder-only models; it can translate to and from text, speech and images. Combining audio, images, and text helps the model better understand speech context. To improve its performance, we fine-tune a strong language model by blending unsupervised learning with multimodal data to advance speech recognition technology.
ÌýWhat we have found so far is that our LLM-based approach, by harnessing multimodal inputs, can outperform state-of-the-art models of similar size. We achieve significantly more accurate speech recognition in both noisy and quiet environments as well as clearer and more reliable understanding in real-world scenarios.
LLMs can help address the complexities of children’s speech recognition. Looking ahead, we aim to expand this work by exploring multilingual capabilities ( including English-to-Spanish speech translation) and visual information (including equations, slides, and images from lecture materials in video conference formats); leveraging lip movements in classroom settings could also further enhance ASR performance in noisy environments. All of these advancements will improve educational tools and enrich children’s interactions with AI systems, fostering meaningful progress in inclusive and accessible technology.