Deploying Text-to-Speech and Speech-to-Text Models in Apolo
Last updated
Was this helpful?
Last updated
Was this helpful?
In this tutorial, we'll walk through deploying an audio-based machine learning model service called "Speaches" onto the Apolo platform and demonstrate its Speech-to-Text and Text-to-Speech capabilities.
We'll be using the "Speaches" project. Speaches is an OpenAI API-compatible server that supports streaming transcription, translation, and speech generation, powered by models like Faster Whisper and Kokoro. We will use a slightly modified, custom Docker image for this deployment.
Open your local terminal.
Create a directory for the demo called speaches-demo
by running the following commands in your terminal:
Create a file called live.yaml
inside the speaches-demo/.apolo
directory with the following content. We will be using to run a OpenAI API-compatible server for the chat functionality in Speaches
Run the following commands to start Speaches:
Understanding the Command
apolo-flow run ollama
checks your `live.yaml` file for a job called `ollama`, creates any needed volumes and start that job
apolo-flow run speaches
checks your `live.yaml` file for a job called `speaches`, creates any needed volumes and start that job
In live.yaml
, we define:
hardware resources to used by each job using the preset
key. In this case, a preset containing one NVIDIA A100 GPU. Some models are small and do not require a GPU. Make sure to check the resources available in your cluster to use the correct preset name. You can get a list of all available presets by running apolo config show
in your terminal
a dependency between these 2 jobs by passing the endpoint of ollama as an environment variable to Speaches env:
Alternativetly if you do not want to test the chat funcionality (which requires a llm), you can use the apolo run
command to start the service as a job on the Apolo platform directly without the need for a live.yaml
file to start an Ollama server. Copy and paste the following command into your terminal to start Speaches as a job
Understanding the command:
apolo run
: Initiates a job run on Apolo.
--name speaches
: Assigns the name "speaches" to our running job.
--preset a100x1
: Specifies the hardware resources to use – in this case, a preset containing one NVIDIA A100 GPU. Some models are small and do not require a GPU. Make sure to check the resources available in your cluster to use the correct preset name. You can get a list of all available presets by running apolo config show
in your terminal
--volume storage:speaches/hf-hub-cache:/home/ubuntu/.cache/huggingface/hub
: Mounts an Apolo storage volume named speaches/hf-hub-cache
(which you might need to create if it doesn't exist) to the container's Hugging Face cache directory. This allows downloaded models to be persisted and reused across job runs, speeding up startup times.
--http-port 8000
: Exposes port 8000 inside the container for HTTP traffic.
--no-http-auth
: Disables Apolo's default HTTP authentication for easier access during this demo.
ghcr.io/neuro-inc/speaches:sha-662eef8-cuda-12.4.1
: Specifies the custom Docker image to run.
Press Enter to execute the command. Apolo will provision the resources and start the job.
Watch the terminal output. It will show the job status transitioning from pending Creating
-> pending Scheduling
-> pending ContainerCreating
-> running
.
Once running, Apolo provides an ✓ Http URL:
. This is the public web address for your deployed Speaches service.
Copy the Http URL
provided in the terminal output.
Paste the URL into your web browser and navigate to it. This opens the "Speaches Playground" UI.
Click on the "Speech-to-Text" tab.
Click the microphone icon (🎙️) below the "Drop Audio Here" area.
Click the "Record" button.
Speak a sentence into your microphone (e.g., "Hi, I am a machine learning engineer who works at Apolo. How are you doing?").
Click the "Stop" button.
From the "Model" dropdown, select a speech recognition model (e.g., systran/faster-whisper-tiny
). The first time you select a model, it might be downloaded, which takes a moment.
Ensure the "Task" dropdown is set to transcribe
.
Click the "Generate" button.
Observe the transcribed text appearing in the "Textbox" below.
Click on the "Text-to-Speech" tab.
The default model (hexgrad/Kokoro-B2M
) requires downloading. Click the "Download Model and Voices" button and wait for the processing to complete (this may take several seconds).
Once downloaded, additional options appear.
Select a "Voice" from the dropdown (e.g., af_nicole
).
Leave the default "Input Text" (or enter your own).
Click the "Generate Speech" button.
Wait for the processing to complete.
An audio player will appear at the bottom. Click the play button (▶️) to listen to the generated speech.
You can also download the generated audio file using the download icon (⬇️).
This section demonstrates how to interact with the chatbot functionality using both text and voice input, receiving both text and synthesized speech output.
Scenario 1: Using Text Input
Click on the "Audio Chat" tab (this should be the default tab shown).
Observe that a Chat Model (e.g., gemma3:4b) is selected in the "Chat Model" dropdown. This model was likely started previously via Ollama as mentioned in the video.
Ensure the "Stream" checkbox is checked (as shown in the video) for responses to appear word-by-word.
Locate the text input box labeled "Enter message or upload file...".
Type a message into the text box (e.g., "hello, how are you doing?").
Click the send icon (➡️).
Wait a few moments. Observe the chatbot's text response appearing incrementally in the "Chatbot" area below.
Once the text response is complete, observe an audio player appearing below the text, containing the synthesized speech of the chatbot's answer.
(Optional) Click the play button (▶️) on the audio player to hear the response.
Scenario 2: Using Voice Input
Ensure you are on the "Audio Chat" tab.
Click the microphone icon (🎙️) located to the left of the text input box.
Click the "Record" button that appears.
Speak your question or statement into your microphone (e.g., "Hi. Can you tell me a little bit about MLOps?").
Click the "Stop" button when you are finished speaking.
Click the send icon (➡️).
Wait for the processing. First, your spoken audio will be transcribed (though not explicitly shown, this happens internally), then sent to the chat model.
Observe the chatbot's detailed text response appearing in the "Chatbot" area.
Once the text response is complete, observe an audio player appearing below the text, containing the synthesized speech of the chatbot's answer.
(Optional) Click the play button (▶️) on the audio player to hear the response.
Besides the UI, you can interact with the Speaches service directly via its API.
Method A: Direct HTTP Request
Open your code editor (like VS Code) and create a Python script called tts-request.py
Paste the following code in this file
Remember to modify the base_url
variable to the URL of your running Speaches server
Run the script from your terminal: python tts-request.py
.
Open the generated output.mp3
file to hear the speech ("Hello, world!").
Method B: Using the OpenAI SDK
Since Speaches is OpenAI API compatible, you can use the official openai
Python library.
Create another Python script called tts-openai.py
Copy the following code to your file
Initialize the client, crucially setting the base_url
to your Apolo Speaches URL plus /v1
at the end.
The api_key
used by OpenAI SDK needs to be a non-empty string. You can keep it as is in the example
Run the script: python tts-openai.py
.
Open the generated output-openai.mp3
file to hear the speech ("Hello, world!" with the Nicole voice).
This tutorial demonstrated how to deploy the Speaches audio service on Apolo using the Apolo CLI, interact with it via its web UI for Speech-to-Text and Text-to-Speech, and finally, how to call its API programmatically using both direct HTTP requests and the OpenAI SDK.