A Comprehensive* Speech-to-Text Benchmark**

Danny D. Leybzon

Apr 5, 2023

I wrote 300+ lines of spaghetti code and all I got was this lousy table

Read →

4 Comments

Dylan Black

Apr 5, 2023

Are speech to phoneme models language agnostic? As in, if I’m a native Spanish speaker, will it “understand” my accent?

Expand full comment

Reply (1)

Danny D. Leybzon

Apr 5, 2023

Good question! I'm still trying to wrap my head around STP models and how phonemes get modeled. I've been heads down with this benchmark until yesterday, and just started implementing the STP model after finishing this post.

My understanding is that phonemes are language-specific—especially in the boundaries between vowel sounds (e.g. have you heard a person with a Slavic accent say "beach" but it sounds like "bitch"?) but also for some consonants—but that the IPA creates something like a shared representation of phonemes across languages.

I'm not sure what you mean by the second question. If you're a native Spanish speaker and you're speaking English? Or if you're a native Spanish speaker and you're speaking Spanish? In either case, it should be able to detect the phonemes that you're saying in the language that you're speaking and then you can create a dictionary to map those phonemes to words in that language. Spanish is an interesting one because there are myriad accents from native Spanish-speakers that actually map to distinct phonemes ("Como te llamas" can be "Como te \j\amas" or "Como te \y\amas"). A phoneme-to-word dictionary needs to be able to model that many-to-one relationship between pronunciations and the word. Optimally, I guess a system would figure out which "Spanish" you're trying to learn (are you speaking Argentinian Spanish? European Spanish? Mexican Spanish?) and advise you on both the vocabulary and the pronunciation in that version of the language. For now, I'm just going to treat all vocabulary and pronunciation as valid :)

Expand full comment

Reply (1)

Dylan Black

Apr 5, 2023

Hmm let me see if I can ask a better question:

When I map speech to phoneme to word, is my accent going to materially affect the transcription error rates? For example, let’s say I’m benchmarking a speech to text system, would a Spanish speaker benefit from a different encoding system? In other words, how robust is the speech-phoneme-word mapping to pronunciation variation?

Expand full comment

Reply (1)

Danny D. Leybzon

Apr 6, 2023

Yes! And that's exactly why STP is a superior choice for language instruction. By using phonemes, the system can identify when a speaker is using non-standard phonemes to try to express a word in the language that they're speaking. This is useful for giving coaching and feedback to users about how they're pronouncing words in the language that they're learning.

Expand full comment

Marti AI

A Comprehensive* Speech-to-Text Benchmark**