Google ColabでWhisper large-v3を動かしてみる

2023年11月20日 19:16

Whisper large-v3は、音声認識のモデルとなります。喋った言葉をテキストとして認識できるということです。

今回は、下記のページを参考にGoogle Colabで動かしてみました。

今回のコードは、Pythonでの音声認識のためのプログラムです。まず、必要なパッケージ（pip, transformers, accelerate, datasets）をインストールし、torchやtransformersライブラリを使ってOpenAIのwhisper-large-v3モデルをセットアップしています。

torchとtransformersライブラリをインポートしています。
使用するデバイス（GPUまたはCPU）とデータの型を設定しています。
whisper-large-v3モデルを読み込んで、選択したデバイスに配置しています。
音声認識のためのプロセッサとパイプラインを設定しています。
distil-whisper/librispeech_longデータセットからバリデーションセットを読み込んでいます。
データセットの最初のサンプルを取得し、音声認識パイプラインに通して結果を表示しています。

このプログラムの目的は、与えられた音声サンプルからテキストを抽出することです。

!pip install --upgrade pip
!pip install --upgrade git+https://github.com/huggingface/transformers.git accelerate datasets[audio]

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=30,
    batch_size=16,
    return_timestamps=True,
    torch_dtype=torch_dtype,
    device=device,
)

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]

result = pipe(sample)
print(result["text"])

Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel. Nor is Mr. Quilter's manner less interesting than his matter. He tells us that at this festive season of the year, with Christmas and roast beef looming before us, similes drawn from eating and its results occur most readily to the mind. He has grave doubts whether Sir Frederick Leighton's work is really Greek after all, and can discover in it but little of rocky Ithaca. Linnell's pictures are a sort of Upguards and Adam paintings, and Mason's exquisite idylls are as national as a jingo poem. Mr. Burkett Foster's landscapes smile at one much in the same way that Mr. Carker used to flash his teeth. And Mr. John Collier gives his sitter a cheerful slap on the back before he says, like a shampooer in a Turkish bath, Next man!

実行結果

所感は、すんなりとGoogle Colabで動いたので使いやすいです。汎用性も高いため今後使われていくのではないでしょうか。

この記事が気に入ったらサポートをしてみませんか？