Marti v0: Present

Prototype of an AI-powered language tutor

Feb 28, 2023

Note: this is the first post of a two-part series about the prototype of Marti an AI-powered language tutor. This post focuses on the present state of Marti, while the next post presents an analysis of the program’s modules and a roadmap for future improvements for the project.

*“Photograph” generated with Midjourney, screenshot overlaid with Canva*

A key part of my development philosophy for Marti is iteration. I am prone to spending too much time planning and strategizing at the cost of execution and my projects can sometimes suffer because I am not willing to experiment and iterate toward better solutions. To challenge myself on this anti-pattern and improve my ability to create iteratively, I decided that I would start by creating the simplest possible version of Marti and then improve it piece by piece, rather than trying to build the perfect solution right away.

Since the most base idea for Marti can be broken down into three idempotent modules, I’ll share details of how I selected solutions for each of the modules. I’ll also describe the main file, which I use to stitch the different modules together. If you want to skip ahead to the code and ignore my exposition about each module, you can check out the Marti repo on my GitHub.

Hear

The first module that I worked on I named “hear” and appears as “hear.py” on GitHub. Its goal is simple: translate speech that somebody says into their computer microphone into text that can be interpreted by the “think” component.

To learn about the performance of the hear module and where I want to take it from here, check out the corresponding section in the Marti v0: Future post.

Code

import speech_recognition as sr

def input():
   r = sr.Recognizer()
   mic = sr.Microphone()

   print("System Prompt: What do you want to say to Marti?")

   mic = sr.Microphone()
   with mic as source:
       audio = r.listen(source)

Tool

To do this, I used the SpeechRecognition library in Python. SpeechRecognition is a very simple but very useful library to do exactly what I was looking for: convert speech to text. SpeechRecognition does this by acting as a wrapper around a number of different speech recognition APIs, including the default Google Speech Recognition, Google’s Speech-to-Text, Microsoft Azure’s Speech, CMU’s Sphinx, OpenAI’s Whisper, and more. The options from Google and Microsoft require internet access to send requests to their service APIs, while the latter two can be run entirely locally.

What makes SpeechRecognition so convenient for my use case is that I can quickly iterate and try out different APIs for speech recognition, selecting the one that performs best for both latency and fluency. By having a single API wrapper around all of these different services, Anthony has made it easy for us to try out different services and select the ones that are best for our use case.

To keep things simple for the prototype, I defaulted to using the Google Speech Recognition API. This API doesn’t require providing an API key or installing any additional software, so it was the easiest option for this task.

Think

After converting the user’s speech into text, the next step is to generate a response to that text. I accomplish this with the “think” module (though astute readers will remember that I am skeptical about referring to LLMs as “thinking”), which you can check out on GitHub.

To learn about the performance of the think module and where I want to take it from here, check out the corresponding section in the Marti v0: Future post.

Code

import openai
import config

openai.organization = config.openai_org
openai.api_key = config.openai_key

def response(input):
  response = openai.Completion.create(
    engine="text-davinci-003",
    prompt="You are a beginner Spanish teacher named Marti. You correct your students when they say something incorrectly in Spanish and otherwise respond conversationally. A student says to you" + input,
    temperature = 0.7,
    max_tokens = 256
  )

  print("Marti's Output: " + response.choices[0].text)
  return response.choices[0].text

Note: I import a package here named “config”. I store the OpenAI API keys that I generated in a file named “config.py”, which I had GitHub ignore to not post my keys publicly and potentially be charged a lot of money. You’ll need to generate your own keys and create your own config file.

Tool

I opted to use the API for OpenAI’s “text-davinci-003” model, which is the state-of-the-art model for their GPT-3 series and which powers the (in)famous ChatGPT. OpenAI highlights that Davinci is the most capable GPT-3 model, though it is slower and more expensive than other models like Curie, Babbage, and Ada. It may be surprising to some, but Davinci actually works quite well on foreign languages, especially ones that are commonly used online and are therefore in the training corpus of the model.

The OpenAI API is straightforward and I only did a little bit of prompt engineering to get the prototype to a point where it felt publishable.

Speak

Finally, after a text response has been generated, that text needs to be converted into audible speech for the user to hear and respond to in turn. It turns out that there are a few options in Python for both generating speech and for playing it back for the user, and it took some experimentation to find one that worked. You can see the full code for the speak.py file, including the StackOverflow answer that I ended up relying on, on GitHub.

To learn about the performance of the speak module and where I want to take it from here, check out the corresponding section in the Marti v0: Future post.

Code

from gtts import gTTS
from io import BytesIO
from pydub import AudioSegment
from pydub.playback import play

def output(response):
    tts = gTTS(text=response, lang='es')
    fp = BytesIO()
    tts.write_to_fp(fp)
    fp.seek(0)

    song = AudioSegment.from_file(fp, format="mp3")
    play(song)
    return None

Tool

I ended up relying on two different Python libraries for this module: gTTS (a wrapper around Google Translate’s text-to-speech API) to generate an audio file and pydub (which allows users to interact with FFmpeg) to play the audio file from the computer’s speakers.

The first library is especially useful because it allows us to generate speech without paying for a service like Google Cloud Platform’s Text-to-Speech API. I could even specify that the text was Spanish, which changed the voice from sounding like an American horrifically butchering Spanish to a very robotic Spanish teacher.

Another feature that I looked for, even in this prototype version, was the ability to maintain the audio file in memory, rather than having to write it to disk and read it back again. Many of the solutions that I found for playing back audio in Python were designed to read a file from disk, which would introduce latency to the system compared to just reading the audio directly from memory.

marti.py

The marti.py file isn’t exactly a module in its own right, but I figure it’s worth calling out for completeness. For those unfamiliar with Python, a main file is used as the entry point to software. When executing a multi-file Python program from the command line, designating a single main file makes it easy to control program execution.

import hear
import think
import speak


def main():
    while(True):
        input = hear.input()
        if "finish conversation" not in input:
            response = think.response(input)
            speak.output(response)
        else:
            speak.end_convo()
            break

if __name__ == "__main__":
    main()

In my case, my main file only contains a single, main() function and an if statement to call the main() function. Inside the main() function is an endlessly-repeating loop that first calls the hear module and then checks whether the user has said “finish conversation”. If the user has said “finish conversation”, the program calls a function to vocalize exit text in the speak module. If the user’s input doesn’t include “finish conversation”, the output from the hear module is passed to the think module, and then the output from that is passed to the speak module. As long as the user doesn’t say “finish conversation” (or otherwise break the conversational loop, such as by exiting with Ctrl-C), the loop will continue.

While there are likely improvements to be made in the organization and execution of this file, they will be incremental and minor tweaks. As such, I won’t dedicate text in the next post to them.

Conclusion

While the Marti prototype is relatively simple and straightforward, building this simplest possible version of the application I had in mind gave me an opportunity to start experimenting with a completed MVP immediately. This was invaluable because from here I can prioritize the most important problems.

To continue reading about Marti, my analysis of the prototype, and what I’ll be working on to improve it, check out the next post and subscribe below:

Marti AI

Discussion about this post