Using the latest whisper model to transcribe speech
Published
December 26, 2023
The whisper model is a high quality speech to text model (Alec Radford 2022). It was originally released in 2022. Since then two additional versions have been released which have improved accuracy.
I’m going to quickly evaluate the quality of the v3 model which was released two months ago.
Example Code
The model card for whisper v3 is very good and includes some example code. We can start by comparing this to the original whisper. There is an example utterance available and I will find another longer utterance to test against.
Example Utterance
To start with I want to be able to transcribe this machine generated speech:
Found cached dataset librispeech_asr_dummy (/home/matthew/.cache/huggingface/datasets/hf-internal-testing___librispeech_asr_dummy/clean/2.1.0/d3bc4c2bc2078fcde3ad0f0f635862e4c0fef78ba94c4a34c4c250a097af240b)
Listening to this I would transcribe this as:
Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel
I’m not really sure what this is from. It’s machine generated and there is no background noise so this test represents the ideal conditions.
It would be good to have a more realistic utterance.
Code
from IPython.display import Audiolondon_audio ="london.flac"Audio(london_audio)
This is from a video where someone walks around London asking people how many languages they speak (source). I think this is more realistic as there is some background noise and they interrupt each other. My transcription of this would be:
Speaker 1: You’re from the states?
Speaker 2: Yes
Speaker 1: Are you
Speaker 2: I’ve lived here for about 9 years
Speaker 1: 9 years, incredible, how many languages do you speak?
Speaker 2: Just one
The are you and I’ve lived utterances overlap so it will be interesting to see how well whisper transcribes this.
Whisper V1
There is example code for using this on the whisper large model card. I’m actually going to use the huggingface pipeline for this as it can handle filenames which will make processing the london utterance a lot easier. Quite neat really.
' Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.'
Code
transcribe_v1(london_audio)["text"]
" Yes. I've lived here for about nine years. Just one."
It transcribes the machine generated text easily. Interestingly it transcribes only the second speaker in the London conversation. I wonder if there is a flag for this?
/home/matthew/.cache/pypoetry/virtualenvs/blog-HrtMnrOS-py3.11/lib/python3.11/site-packages/transformers/pipelines/base.py:1081: UserWarning: You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
warnings.warn(
{'text': ' you from the states argue about nine years nine years credible honey languages to speak just one',
'chunks': [{'timestamp': (0.0, 1.9), 'text': ' you from the states'},
{'timestamp': (1.9, 3.46), 'text': ' argue about'},
{'timestamp': (3.46, 7.52),
'text': ' nine years nine years credible honey languages to speak just one'}]}
By chunking the text we can get different speakers but the quality of the transcription has dramatically decreased. Let’s see how well Whisper v3 handles this.
Whisper V3
We can just change the name of the model to use the latest version! Let’s see how the transcription changes.
{'text': " You're from the States? Yes. I've lived here for about nine years. Nine years. Incredible. How many languages do you speak? Just one. Oh, yeah, he's great.",
'chunks': [{'timestamp': (0.0, 0.6), 'text': " You're from the States?"},
{'timestamp': (0.94, 1.22), 'text': ' Yes.'},
{'timestamp': (1.84, 4.0), 'text': " I've lived here for about nine years."},
{'timestamp': (4.3, 4.92), 'text': ' Nine years.'},
{'timestamp': (5.2, 5.6), 'text': ' Incredible.'},
{'timestamp': (5.84, 6.82), 'text': ' How many languages do you speak?'},
{'timestamp': (6.98, 7.5), 'text': ' Just one.'},
{'timestamp': (7.5, 8.0), 'text': " Oh, yeah, he's great."}]}
Whisper v3 has perfectly transcribed the two speakers. It has even tried to transcribe some of the background noise, although I am not confident that transcription is correct.