
Imagine having a language tutor with whom you can converse from your phone or laptop at any time. A tutor who is a master of the language that you want to learn and personalizes their instruction to your ability. This is the goal of Marti AI.
Three motivations and a goal
Marti (a syllabalic abbreviation of “maestro artificial”—“synthetic teacher” in Spanish) spawned from a series of discussions that I had over the course of January 2023. I am extremely bullish on the power of AI to make personal tutoring more accessible (more on this in a later post) and wanted to dip my toe into the water by getting my hands dirty with a project. Given the recent hype around large language models (LLMs) from the release of ChatGPT, I was excited to have an opportunity to play around with a technology that people have been raving about. I’ve been feeling the itch to build and work on something that I could obsess about in the shower, so Marti felt like an obvious choice.
In addition to wanting to get more experience with AI for personal tutoring and exploring a proof of concept for this idea, I also wanted to have an opportunity to refresh and refine my Python fluency. A few months ago, based on a throwaway comment from my then-CEO, I asked one of the engineers at the company that I was working for to interview me the way that he interviews SWE candidates. Although I passed his relatively straightforward coding exercise, I was embarrassed by how hard I struggled to remember syntactic details of Python that had been second nature to me once.
Plus, I just wanted something like Marti to be able to improve my Spanish while living outside of Spanish-speaking countries.
With these motivations bubbling up in conversations that I was having with friends, families, and colleagues, I started thinking more seriously about how I could work on a project that would check all three boxes. Eventually, I settled on a goal to use as a true north while experimenting and building Marti:
Create a bot that users can speak with to improve their language acquisition when they don’t have access to native speakers or teachers.
What’s out there
With my objective in mind, I started researching what was already available in the field that might match what I was looking to build. Given the popularity of ChatGPT and the myriad posts I saw across my LinkedIn and Twitter newsfeeds, I had assumed that lots of people had already hooked up GPT-3 to speech recognition and speech synthesis technologies to be able to communicate with it verbally instead of in a chat. I’ll share my analysis of the existing options in a later post, but I was surprised by the dearth of speech-to-text-to-LLM-to-text-to-speech projects on the web.
The one project that is most similar to what I was envisioning building is James Weaver’s (JavaFXpert) project “Talk With GPT-3”. James’s project is cool and even includes a talking head animation to go along with the dialogue, which was an idea that I’d had for Marti but am treating as aspirational.
There were a few things that differentiated James’s project from what I had in mind:
Most importantly, James focused on creating a UI for people to verbally communicate with GPT-3 and improved accessibility for this UI by making it polyglottal. This was related but different than what I had in mind: I wanted to specifically create a tool that people could use to learn a new language, not just use a non-English language to communicate with the LLM
As a result, James did very little prompt engineering or model tuning to the effect of making the conversational partner a good teacher. This makes sense, given that his goal was simply to make a polyglottal speech bot, rather than a language teacher. The result is that “Talk With GPT-3” doesn’t do much to help students improve their foreign language skills
James built “Talk With GPT-3” with JavaScript and didn’t offer a hosted version of it. I want to build my project in Python—not that this matters to end-users, who only care about the UX, rather than the internals—and, if there’s demand, offer a hosted version for people who don’t want to deal with setting up and hosting the project themselves
Once I was satisfied that no one else had already done the exact thing that I was setting out to do, I was ready to move on to planning my project.
Simple proof of concept
I have lofty and aspirational ideas for how to make Marti maximally cool and useful, but I wanted to start with the simplest possible prototype so that I could prove out the concept. I sketched out this simple diagram to capture the components of what Marti would need to be:

There are three basic modules that make up the bare minimum requirements for Marti: a speech recognition (also called speech-to-text) component for the user to speak to, a component to process the converted input and generate a response, and a speech synthesis (also called text-to-speech) module to vocalize the response to the user.
In addition to these three MVP modules, my initial idea for Marti also included an option for hosting it (e.g. as a web app) and for generating a visual display for the user to look at (e.g. using a video synthesis tool). I am interested in exploring these other components for the system, though they aren’t as high in priority as refining the core user experience.
Refinement priorities
Now that I’ve completed the prototype (which I’ll release the source code for alongside a later blog post), I’m starting to think about how to refine the application. From my experiences exploring other, similar applications as well as testing my own, I came up with three priority areas for refinement:
Latency
Fluency
Intelligence
Latency
The first thing that I noticed about my and other applications is the latency between when I speak or enter text and when I get a response. In my case, there are three modules, each of which contributes to the latency of the application. In my coming post on the prototype, I’ll share my analysis of how much each module contributes to the latency, and in future posts refining each module, I’ll emphasize latency as a UX KPI when making choices between different approaches.
Fluency
In addition to latency, a major issue with speech-based systems is fluency. I’m using fluency to refer to the ability of a computer system to convert human speech into machine-readable text (i.e. speech recognition) and to produce speech that sounds fluent and natural to a human ear (i.e. speech synthesis). Computer scientists have made great strides in improving both aspects of fluency in computer systems and I’m looking forward to exploring and taking advantage of those advances. In future posts about refining the listening and speaking modules, I’ll emphasize fluency—composed of both the word error rate and my subjective feeling about how “natural” the synthesized voice sounds—when evaluating speech-to-text and text-to-speech options.
Intelligence
Perhaps the least easily quantified area for refinement is in the intelligence of the system. This stems entirely from the “Thinking” module, where I’ve used GPT-3 so far and will explore prompt engineering as well as alternative algorithms (including self-hosted LLMs) to make Marti give more helpful responses. In my experiments with the prototype so far, I’ve noticed that sometimes the bot fails to correct errors in my Spanish, and other times it “corrects” me by telling me that I said something wrong and then telling me to say the exact same thing again.
Clearly, there’s room for improvement in the intelligence of the system; that’s what I’ll be refining when I experiment with different options for the “Thinking” module. To evaluate this area with something resembling objectivity, I’ll compose a benchmark to test how good the system is at correcting beginner, intermediate, and advanced language mistakes.
Why you should subscribe
In the coming posts, I’ll be talking about various aspects of building Marti. The posts that I’ve got already in the pipeline or planned are:
“Why this project”, where I’ll explain in greater depth why I think AI private tutoring for languages is a valuable thing to work on
A post where I’ll expand on the “What’s out there” section and share my analysis of similar projects in the space
A post where I’ll describe the prototype that I built in more depth, along with releasing the source code for it
A series of posts for refining each of the modules (Listening, Thinking, and Speaking) where I’ll explore and select the best option for each of them
And, if there’s interest in it, posts for exploring the optional modules (Visual Display and Deployment)
If you’re interested in following along and learning about this journey of building an AI language tutor, subscribe below: