A three dimensional, computerized human face that converses with hearing-impaired children using state-of-the-art speech recognition and speech generation technologies is showing students how to understand and produce spoken language.
Developed with a three-year, $1.8 million National Science Foundation grant, the computer project could transform the way language is taught to hearing-impaired children, said University of Colorado at Boulder Professor Ron Cole.
The face, a 3D "tutor" dubbed "Baldi," is able to help students learn vocabulary and to produce words accurately, said Cole, the project's director. Baldi's 3-D animation includes movements of the lips, teeth, tongue and jaw to produce accurate and visible speech synchronized with either synthetic or naturally recorded audible speech.
In addition, the system's curriculum development software allows teachers to customize class work, said Cole, director of the Center for Spoken Language Research at CU's Cognitive Science Institute. Teachers and students can use simple computer tools known as "wizards" to create various applications and work at their own paces.
Students periodically can review class work and homework lessons to improve their vocabulary, reading and spelling. The project allows students to study how subtle facial movements produce desired sounds, said Cole. "There is no question that kids are benefiting from it," he said.
The 3-D animation is based on work by University of California, Santa Cruz psychology Professor Dominic Massaro. The tongue movements are based on data collected by researchers at Baltimore's Johns Hopkins University.
The facial animation, speech recognition and speech synthesis systems reside in a software package known as CSLU that was designed at the Oregon Graduate School under Cole's direction.
"The project began with a vision in the mid-1990s to develop free software for spoken language systems and their underlying technologies," said Cole. "We wanted to give researchers the means to improve and share language tools that enhance learning and increase access to information."
In order to create Baldi's speech recognition capabilities, the researchers compiled a database of speech from more than 1,000 children. The samples shaped mathematical models for recognizing the fine details in the children's speech. In addition, the animated speech produced by Baldi is accurate enough to be understandable to users who read lips, said Cole.
The pilot study was begun in grades 6 through 12 at the Tucker-Maxon Oral School in Portland, Ore., said Cole. The Center for Spoken Language Understanding at the private Oregon Graduate Institute, the University of California, Santa Cruz's Perceptual Science Laboratory and the University of Edinburgh, Scotland, also contributed to the research.
At the Tucker-Maxon school in Oregon, Baldi is used by profoundly deaf children whose hearing is enhanced through amplification or electrical stimulation of the ear's cochlea, said Cole. The teachers use a toolkit, available via the Web at no cost to researchers and educators, allowing teachers and students to design their own multimedia learning processes.
"The students report that working with Baldi is one of their favorite activities. The teachers and speech therapist report that both learning and language skills are improving dramatically," Cole said.
"Activities in the classroom are more efficient, since students can work simultaneously on different computers, with each receiving individualized instruction while the teacher observes and interacts with selected students."
Project results may be incorporated into animated conversational faces like Baldi for non-hearing impaired applications such as learning new languages or diagnosing and treating speech disorders, he said.
Cole recently received a five-year, $4 million grant from NSF's Information Technology Research Initiative. The new project will develop interactive books and virtual tutors for children with reading disabilities. The next generation of Baldi will use the latest computer technologies to interpret facial expressions, integrating feedback from audible and visible speech cues.