Speech Recognition with Transformers and Gradio

Is it possible to make a better speech recognizer?
Published

April 9, 2022

I’ve previously attempted to make a chatbot that could respond to spoken conversation. While the chatbot part worked to a degree, the recognition of the speech was very poor. This meant that the experience of chatting was very poor.

Gradio has recently written a post about real time speech detection. I’m going to try that out. After using it myself I found it quite poor, hopefully I can substitute in a different model and get better performance.

Copying Gradio Blog Post

Let’s start by just copying the blog post and seeing how it does.

Code
from transformers import pipeline

speech_pipeline = pipeline("automatic-speech-recognition")
No model was supplied, defaulted to facebook/wav2vec2-base-960h (https://huggingface.co/facebook/wav2vec2-base-960h)
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Code
import gradio as gr

def transcribe(audio):
    text = speech_pipeline(audio)["text"]
    return text

speech_interface = gr.Interface(
    fn=transcribe, 
    inputs=gr.inputs.Audio(
        source="microphone",
        type="filepath"
    ), 
    outputs="text"
)
speech_interface.launch(share=True)

After some teething problems this is now working. I said hello, hello, how are you? and it heard HALLO HALLO HOW ARE YET. This would be an absolute accuracy of 0.4 as only 2 of the 5 words are correct (or 0.8 if you consider hallo and hello to match).

The blog post has more ways to improve this though. Let’s try the streaming approach:

Code
import gradio as gr
import time

def transcribe(audio, state=""):
    text = speech_pipeline(audio)["text"]
    state += text + " "
    return state, state

speech_interface = gr.Interface(
    fn=transcribe, 
    inputs=[
        gr.inputs.Audio(source="microphone", type="filepath"), 
        "state"
    ],
    outputs=[
        "textbox",
        "state"
    ],
    live=True
)
speech_interface.launch(share=True)

This streaming approach works by recording short snippets which are then decoded. It does not handle pauses in the speech well, nor utterances that cross the boundaries. My same utterance was now translated as HALLA EN HELLA I Y. Even with the suggested fix (adding a pause) it does not improve.

Sample Rate

One thing to consider is if the sample rate for my microphone is aligned with what the model expects. The model needs to work with 16khz audio and my microphone records at 48khz. If the data is not downsampled then this could lead to poor performance.

We can test the sample rate by just asking gradio to display it:

Code
import gradio as gr

def transcribe(audio):
    sample_rate, data = audio
    return sample_rate

speech_interface = gr.Interface(
    fn=transcribe, 
    inputs=gr.inputs.Audio(
        source="microphone",
        type="numpy"
    ), 
    outputs="text"
)
speech_interface.launch(share=True)

This has shown a problem with this, in that the reported sample rate is 48,000. The speech pipeline accepts a numpy array so I can resample that.

Code
import gradio as gr
from scipy.signal import resample

def transcribe(audio):
    sample_rate, data = audio
    samples = data.shape[0]
    data = data.mean(axis=1)
    data = resample(data, samples//3)
    text = speech_pipeline(data)["text"]
    return text

speech_interface = gr.Interface(
    fn=transcribe, 
    inputs=gr.inputs.Audio(
        source="microphone",
        type="numpy"
    ), 
    outputs="text"
)
speech_interface.launch(share=True)

Resampling has worked and the output of the model is now HALLO HALLO HOW ARE YOU. It does have a problem with background noise, so I consider this to be as accurate as the first time I tried this. It’s likely that the file based approach that was used in the blog post allows the model to resample the input.

Model Quality

The last time I was doing this I found a great variation in the quality of the different speech recognition models. Trying out some different ones might be productive. We can review the different models available on the huggingface website. The existing pipeline is using the facebook-wav2vec-base model.

Code
from transformers import pipeline

speech_pipeline = pipeline("automatic-speech-recognition", model="facebook/hubert-large-ls960-ft")

This model has changed the prediction to HULLO HULLO HOW ARE YOU.

Code
from transformers import pipeline

speech_pipeline = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-large-960h-lv60-self")

This got HELLO HELLO HOW ARE YOU, which is the first perfect transcription. I’m going to try a longer utterance.

This time round it got CALL ME ISHMAEL SOME YEARS AGO NEVER MIND HOW LONG PRECISELY HAVING LITTLE OR NO MONEY IN MY PURSE AND NOTHING PARTICULAR TO INTEREST ME ON SHORE I THOUGHT I WOULD SAIL ABOUT A LITTLE AND SEE THE WATERY PART OF THE WORLD. This is a perfect transcription once again. I’m very happy with this now, if I was to update the house application I would use gradio.

Final Notes

Gradio is really good and I’ve used it a lot in the past. I’m pleased that they were acquired by huggingface as it provides some financial stability to the tool and gradio already worked really well with huggingface models.

I might rework the house conversation bot I made before to work with this.