Performing real-time Speech-to-Text using AssemblyAI API in Python

Introduction

In one of my latest articles we explored how to perform offline speech recognition with AssemblyAI API and Python. In other words, we uploaded the desired audio file to a hosting service and then we used the transcript endpoint of the API in order to perform speech-to-text.

In today’s guide we will showcase how to perform real-time speech-to-text using the real-time transcription feature of AssemblyAI API that lets us transcribe audio-streams in real-time with high accuracy.

Let’s get started!

Installing PyAudio and Websockets

In order to be able to build real-time speech recognition we need a tool that will let us record audio. PyAudio, is a Python library that provides bindings for PortAudio, the cross-platform audio I/O library. Using this library we can play or record audio at real-time on pretty much any platform including OSX, Linux and MS Windows.

First, we need to install portaudio. On OSX you can do so by using HomeBrew:

brew install portaudio

and then install PyAudio from PyPI:

pip install pyaudio

If you are on Windows, you can install PyAudio through a wheel file you can find here, based on your Python version.

Additionally, we’ll need to install websockets

pip install websockets

If you want to follow along this tutorial, all you need is an API Key that you could get if you sign up for an AssemblyAI account. Once you do so, your key should be visible on your Account section. Additionally, you’ll need to upgrade your account (go to Billing to do so) in order to access premium features.

Real-time Speech-to-Text using AssemblyAI API

AssemblyAI offers a Speech-To-Text API that is built using advanced Artificial Intelligence methods and facilitates transcription of both video and audio files. In today’s guide we are going use this API in order to perform speech recognition at real-time!

Now the first thing we need to do is open a stream using PyAudio by specifying a few parameters such as the frames per buffer, sample rate, format and number of channels. Our stream will look like the one shown in the code snippet below:

import pyaudio
p = pyaudio.PyAudio()
stream = p.open(
frames_per_buffer=3200,
rate=16000,
format=pyaudio.paInt16,
channels=1,
input=True,
)

view rawreal_time_speech_to_text.py hosted with ❤ by GitHub

Opening audio stream with PyAudio — Source: Author

The above code snippet will open an audio stream that receives the input from our microphone.

Now that we’ve opened the stream, we somehow need to pass it to AssemblyAI at real time using web-sockets in order to perform speech recognition. To do so, we need to define an asynchronous function that will open a websocket so that we can send and receive data at the same time.

Therefore, we need to define two inner asynchronous functions — one will be used for reading in chunks of data and the second one will be used for receiving chunks of data.

Here’s the first method for sending messages over the websocket connection:

import json
import base64
import asyncio
async def send_data():
“””
Asynchronous function used for sending data
“””
while True:
try:
data = audio_stream.read(FRAMES_PER_BUFFER)
data = base64.b64encode(data).decode(‘utf-8’)
await ws_connection.send(json.dumps({‘audio_data’: str(data)}))
except Exception as e:
print(f’Something went wrong. Error code was {e.code}’)
break
await asyncio.sleep(0.5)
return True

view rawreal_time_speech_to_text.py hosted with ❤ by GitHub

Sending data over websocket connections — Source: Author

And here’s the second method used for receiving data over the websocket connection:

import json
async def receive_data():
“””
Asynchronous function used for receiving data
“””
while True:
try:
received_msg = await ws_connection.recv()
print(json.loads(received_msg)[‘text’])
except Exception as e:
print(f’Something went wrong. Error code was {e.code}’)
break

view rawreal_time_speech_to_text.py hosted with ❤ by GitHub

Receiving data over websocket connections — Source: Author

Note that in the methods above we are not handling any specific exceptions but you may wish to process different exceptions and error codes appropriately as required by your specific use case. For more details regarding the error conditions you can refer to the relevant section of the official AssemblyAI’s documentation that defines closing and status codes.

Now the main asynchronous function that makes uses of the two aforementioned functions is defined below.

import json
import base64
import asyncio
import websockets
SAMPLE_RATE=16000
FRAMES_PER_BUFFER = 3200
API_KEY = ‘<your AssemblyAI Key goes here>’
ASSEMBLYAI_ENDPOINT = f’wss://api.assemblyai.com/v2/realtime/ws?sample_rate={SAMPLE_RATE}’
async def speech_to_text():
“””
Asynchronous function used to perfrom real-time speech-to-text using AssemblyAI API
“””
async with websockets.connect(
ASSEMBLYAI_ENDPOINT,
ping_interval=5,
ping_timeout=20,
extra_headers=((‘Authorization’, API_KEY), ),
) as ws_connection:
await asyncio.sleep(0.5)
await ws_connection.recv()
print(‘Websocket connection initialised’)
async def send_data():
“””
Asynchronous function used for sending data
“””
while True:
try:
data = audio_stream.read(FRAMES_PER_BUFFER)
data = base64.b64encode(data).decode(‘utf-8’)
await ws_connection.send(json.dumps({‘audio_data’: str(data)}))
except Exception as e:
print(f’Something went wrong. Error code was {e.code}’)
break
await asyncio.sleep(0.5)
return True
async def receive_data():
“””
Asynchronous function used for receiving data
“””
while True:
try:
received_msg = await ws_connection.recv()
print(json.loads(received_msg)[‘text’])
except Exception as e:
print(f’Something went wrong. Error code was {e.code}’)
break
data_sent, data_received = await asyncio.gather(send_data(), receive_data())

view rawreal_time_speech_to_text.py hosted with ❤ by GitHub

Sending and receiving data over websocket connections — Source: Author

Full Code

The Gist below contains the full code that we are going to use in order to perform real-time speech-to-text using AssemblyAI’s API.

import json
import base64
import asyncio
import pyaudio
import websockets
SAMPLE_RATE=16000
FRAMES_PER_BUFFER = 3200
API_KEY = ‘<your AssemblyAI Key goes here>’
ASSEMBLYAI_ENDPOINT = f’wss://api.assemblyai.com/v2/realtime/ws?sample_rate={SAMPLE_RATE}’
p = pyaudio.PyAudio()
audio_stream = p.open(
frames_per_buffer=FRAMES_PER_BUFFER,
rate=SAMPLE_RATE,
format=pyaudio.paInt16,
channels=1,
input=True,
)
async def speech_to_text():
“””
Asynchronous function used to perfrom real-time speech-to-text using AssemblyAI API
“””
async with websockets.connect(
ASSEMBLYAI_ENDPOINT,
ping_interval=5,
ping_timeout=20,
extra_headers=((‘Authorization’, API_KEY), ),
) as ws_connection:
await asyncio.sleep(0.5)
await ws_connection.recv()
print(‘Websocket connection initialised’)
async def send_data():
“””
Asynchronous function used for sending data
“””
while True:
try:
data = audio_stream.read(FRAMES_PER_BUFFER)
data = base64.b64encode(data).decode(‘utf-8’)
await ws_connection.send(json.dumps({‘audio_data’: str(data)}))
except Exception as e:
print(f’Something went wrong. Error code was {e.code}’)
break
await asyncio.sleep(0.5)
return True
async def receive_data():
“””
Asynchronous function used for receiving data
“””
while True:
try:
received_msg = await ws_connection.recv()
print(json.loads(received_msg)[‘text’])
except Exception as e:
print(f’Something went wrong. Error code was {e.code}’)
break
data_sent, data_received = await asyncio.gather(send_data(), receive_data())

view rawreal_time_speech_to_text.py hosted with ❤ by GitHub

Full code used for performing speech-to-text using AssemblyAI API — Source: Author

Demonstration

Now we have a complete code that is capable of opening an audio stream and sending the input from our microphone and receive the response from AssemblyAI API asynchronously.

In order to run the code all we need to do is pass our async function to asyncio.run():

asyncio.run(speech_to_text())

Now you should be able to speak through your microphone and transcribe the streamed audio.

For the purposes of this tutorial, I’ve uploaded the audio I streamed from my computer in order to perform speech-to-text at real time.

The output after running the program we’ve just written using the above audio stream is shown below:

Websocket connection initialised
You know
You know demons
You know demons on TV
You know, demons on TV like that
You know, demons on TV like that And for
You know, demons on TV like that And for people to
You know, demons on TV like that And for people to expose
You know, demons on TV like that And for people to expose themselves to
You know, demons on TV like that And for people to expose themselves to being
You know, demons on TV like that And for people to expose themselves to being rejected
You know, demons on TV like that And for people to expose themselves to being rejected on TV or
You know, demons on TV like that And for people to expose themselves to being rejected on TV or humiliated
You know, demons on TV like that And for people to expose themselves to being rejected on TV or humiliated by fear
You know, demons on TV like that And for people to expose themselves to being rejected on TV or humiliated by fear factor or

view rawspeech_to_text_real_time_output.txt hosted with ❤ by GitHub

Example output — Source: Author

Final Thoughts

In today’s article we explored how to perform Speech Recognition at real-time, by opening an audio stream and websockets in order to interact with AssemblyAI API. Note we only covered just a small subset of the overall features provided by AssemblyAI API. Make sure to check their full list here.

Original Source

Author

Machine Learning Engineer | Python Developer | https://www.buymeacoffee.com/gmyrianthous