Matthew’s Blog - Transcribing Speech with Whisper v3

The whisper model is a high quality speech to text model (Alec Radford 2022). It was originally released in 2022. Since then two additional versions have been released which have improved accuracy.

Alec Radford, Tao Xu, Jong Wook Kim. 2022. “Robust Speech Recognition via Large-Scale Weak Supervision.” https://cdn.openai.com/papers/whisper.pdf.

I’m going to quickly evaluate the quality of the v3 model which was released two months ago.

Example Code

The model card for whisper v3 is very good and includes some example code. We can start by comparing this to the original whisper. There is an example utterance available and I will find another longer utterance to test against.

Example Utterance

To start with I want to be able to transcribe this machine generated speech:

Code

from datasets import load_dataset
from IPython.display import Audio

ds = load_dataset(
    path="hf-internal-testing/librispeech_asr_dummy",
    name="clean",
    split="validation",
)
machine_audio = ds[0]["audio"]["path"]

Audio(machine_audio)

Found cached dataset librispeech_asr_dummy (/home/matthew/.cache/huggingface/datasets/hf-internal-testing___librispeech_asr_dummy/clean/2.1.0/d3bc4c2bc2078fcde3ad0f0f635862e4c0fef78ba94c4a34c4c250a097af240b)

Listening to this I would transcribe this as:

Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel

I’m not really sure what this is from. It’s machine generated and there is no background noise so this test represents the ideal conditions.

It would be good to have a more realistic utterance.

Code

from IPython.display import Audio

london_audio = "london.flac"

Audio(london_audio)

This is from a video where someone walks around London asking people how many languages they speak (source). I think this is more realistic as there is some background noise and they interrupt each other. My transcription of this would be:

Speaker 1: You’re from the states?
Speaker 2: Yes
Speaker 1: Are you
Speaker 2: I’ve lived here for about 9 years
Speaker 1: 9 years, incredible, how many languages do you speak?
Speaker 2: Just one

The are you and I’ve lived utterances overlap so it will be interesting to see how well whisper transcribes this.

Whisper V1

There is example code for using this on the whisper large model card. I’m actually going to use the huggingface pipeline for this as it can handle filenames which will make processing the london utterance a lot easier. Quite neat really.

Code

import torch
from transformers import pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"
v1_pipe = pipeline(
    "automatic-speech-recognition",
    model="openai/whisper-large",
    chunk_length_s=30,
    device=device,
)

def transcribe_v1(path: str, **kwargs) -> str:
    prediction = v1_pipe(
        path,
        batch_size=8,
        max_new_tokens=448,
        **kwargs,
    )
    return prediction

Code

transcribe_v1(machine_audio)["text"]

' Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.'

Code

transcribe_v1(london_audio)["text"]

" Yes. I've lived here for about nine years. Just one."

It transcribes the machine generated text easily. Interestingly it transcribes only the second speaker in the London conversation. I wonder if there is a flag for this?

Code

transcribe_v1(london_audio, return_timestamps=True)

/home/matthew/.cache/pypoetry/virtualenvs/blog-HrtMnrOS-py3.11/lib/python3.11/site-packages/transformers/pipelines/base.py:1081: UserWarning: You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
  warnings.warn(

{'text': ' you from the states argue about nine years nine years credible honey languages to speak just one',
 'chunks': [{'timestamp': (0.0, 1.9), 'text': ' you from the states'},
  {'timestamp': (1.9, 3.46), 'text': ' argue about'},
  {'timestamp': (3.46, 7.52),
   'text': ' nine years nine years credible honey languages to speak just one'}]}

By chunking the text we can get different speakers but the quality of the transcription has dramatically decreased. Let’s see how well Whisper v3 handles this.

Whisper V3

We can just change the name of the model to use the latest version! Let’s see how the transcription changes.

Code

import torch
from transformers import pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"
v3_pipe = pipeline(
    "automatic-speech-recognition",
    model="openai/whisper-large-v3", ## changed
    chunk_length_s=30,
    device=device,
)

def transcribe_v3(path: str, **kwargs) -> str:
    prediction = v3_pipe(
        path,
        batch_size=8,
        max_new_tokens=448,
        **kwargs,
    )
    return prediction

Code

transcribe_v3(machine_audio)["text"]

' Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.'

Code

transcribe_v3(london_audio)["text"]

" You're from the States? Yes. I've lived here for about nine years. Nine years. Incredible. How many languages do you speak? Just one."

Code

transcribe_v3(london_audio, return_timestamps=True)

{'text': " You're from the States? Yes. I've lived here for about nine years. Nine years. Incredible. How many languages do you speak? Just one. Oh, yeah, he's great.",
 'chunks': [{'timestamp': (0.0, 0.6), 'text': " You're from the States?"},
  {'timestamp': (0.94, 1.22), 'text': ' Yes.'},
  {'timestamp': (1.84, 4.0), 'text': " I've lived here for about nine years."},
  {'timestamp': (4.3, 4.92), 'text': ' Nine years.'},
  {'timestamp': (5.2, 5.6), 'text': ' Incredible.'},
  {'timestamp': (5.84, 6.82), 'text': ' How many languages do you speak?'},
  {'timestamp': (6.98, 7.5), 'text': ' Just one.'},
  {'timestamp': (7.5, 8.0), 'text': " Oh, yeah, he's great."}]}

Whisper v3 has perfectly transcribed the two speakers. It has even tried to transcribe some of the background noise, although I am not confident that transcription is correct.

Even so, a huge win for Whisper v3.