A better ASR option for AI language tutors

A brief update and minor pivot

Mar 21, 2023

It’s been almost a month since my last update about Marti so I wanted to give everyone a sneak peek into what I’ve been working on and what’s coming up next.

In my last analysis, I found that the highest priority module to update in Marti is “hear”. Hear is the speech-to-text component that takes in a user’s speech and converts it to text for the “think” module to respond to. To improve the latency and fluency of the hear module, I’ve been running a benchmark of over a dozen automatic speech recognition/speech-to-text/transcription models/tools/services/APIs.

I’ve spent the last month writing hundreds and hundreds of lines of spaghetti code to programmatically access these different options for the hear module and quantitatively compare them against each other. However, last week, a friend of mine shared an option that is completely derailing my plans to use speech-to-text for this particular project.

Since the coding work is done for it, I’m still going to go ahead with writing up and publishing my benchmark, out of the hope that it will be useful to somebody else on another project. However, I’m going to do a minor pivot to explore an exciting alternative technology for the hear module: speech-to-phoneme models.

Speech-to-phoneme models

In short, speech-to-phoneme models take audio that contains human speech and output phonemes. Phonemes are the base units of spoken language, representing distinct sounds in the language. For instance, the word “car” in English is composed of a phoneme for the /k/ sound (k), the /a/ sound (ɑ), and the /r/ sound (ː).

This makes these models distinct from speech-to-text models in one critical way: speech-to-phoneme models pick up on how words are pronounced, not just what words are said. This means that systems that use phoneme-based models can correct the pronunciation of language learners.

Since pronunciation is such an important part of language acquisition—and since there’s such a high propensity to mispronounce words among language learners—the speech-to-phoneme approach unlocks a lot of value for learners to understand how to speak a language, not just the rules of how the language works.

To take advantage of this approach, I’m going to explore the space of speech-to-phoneme models and understand my options. I’ve sketched out a rough architecture (at the top of the post) of how I think I could use a speech-to-phoneme model in the context of Marti and will also explore alternative ideas for how to incorporate these types of models.

The first model I was introduced to and the one that looks most promising from a cursory search is Wav2Vec2Phoneme, based on Facebook AI Research’s Wav2Vec 2.0 model, which is famous for revolutionizing the use of self-supervised learning for language tasks. In addition to Wav2Vec2Phoneme, the original Wav2Vec was previously used for phoneme analysis, though Wav2Vec 2.0 likely brought many improvements. It looks like after the Wav2Vec2Phoneme publication, unaffiliated researchers also published a meta-analysis on phoneme-based ASR, which may prove valuable.

Where from here

As I mentioned in the first section, my next step is to analyze the data from the speech-to-text benchmark and write up an analysis of the results. Even though I don’t expect to use speech-to-text in Marti now that I’ve found the speech-to-phoneme approach, I still hope that my results and strategy will prove useful to other programmers who want to incorporate speech-to-text into their projects.

After publishing that, I will dive in parallel into a few areas of exploration:

As discussed above, I want to better understand the space of options for speech-to-phoneme models
Since I started working on and writing about Marti, several companies (most significantly Duolingo, but also Memrise, Quazel, and Speak) have released similar products. I will analyze their offerings, as well as others that I may find, to find ways to improve Marti based on the strengths and weaknesses of other approaches
There have been huge advances in hosted LLMs in the last few weeks: OpenAI released GPT-4 (available through ChatGPT), Anthropic launched Claude (I got early access through a friend), and Google opened up the waitlist to Bard (maybe someone reading this at Google will give me access 😉). I’m still not sure whether to use a hosted LLM or one that I tune myself, but it sure is an exciting time to be trying out different LLM chatbots!

To stay up to date on Marti, subscribe below:

Marti AI

Discussion about this post