小白的AIGC课（6）— Whisper API | xiaojing's personal blog

Whiper是openai旗下的一款声音转文本的大模型，API 提供两个语音转文本端点，转录(Transcribe)和翻译(Translate)，基于最先进的开源 large-v2 Whisper 模型。它们可用于：

将音频转录成音频所用的任何语言。
将音频翻译并转录成英文。

文件上传目前限制为 25 MB，支持以下输入文件类型：mp3、mp4、mpeg、mpga、m4a、wav 和 webm。

转录

转录 API 将音频文件作为输入，输出文本。支持多种输入和输出文件格式。

from openai import OpenAI
client = OpenAI()

audio_file= open("/path/to/file/audio.mp3", "rb")
transcription = client.audio.transcriptions.create(
  model="whisper-1", 
  file=audio_file
)
print(transcription.text)

翻译

翻译 API 将任何受支持语言的音频文件作为输入，并在必要时将音频转录为英语。这与转录端点不同，因为输出不是原始输入语言，而是翻译后的英文文本。

from openai import OpenAI
client = OpenAI()

audio_file= open("/path/to/file/german.mp3", "rb")
translation = client.audio.translations.create(
  model="whisper-1", 
  file=audio_file
)
print(translation.text)

上面的代码可以将一段德语的语言翻译为英文。目前仅支持翻译为英文。

提高可靠性

使用 Whisper 时面临的最常见挑战之一是模型通常无法识别不常见的单词或缩写词。为了解决这个问题，有以下两种提高 Whisper 可靠性技术：

使用提示参数

第一种方法是使用可选的提示参数传递正确拼写的字典。由于它不是使用指令跟踪技术进行训练的，因此 Whisper 的运行方式更像一个基础 GPT 模型。重要的是要记住，Whisper 只考虑提示的前 244 个token。

from openai import OpenAI
client = OpenAI()

audio_file = open("/path/to/file/speech.mp3", "rb")
transcription = client.audio.transcriptions.create(
  model="whisper-1", 
  file=audio_file, 
  response_format="text",
  prompt="ZyntriQix, Digique Plus, CynapseFive, VortiQore V8, EchoNix Array, OrbitalLink Seven, DigiFractal Matrix, PULSE, RAPT, B.R.I.C.K., Q.U.A.R.T.Z., F.L.I.N.T."
)
print(transcription.text)

使用 GPT进行后处理

第二种方法是使用 GPT模型进行后处理

我们首先通过 system_prompt 变量为 GPT 提供指令。与我们之前对 prompt 参数所做的操作类似，我们可以定义公司和产品名称。

system_prompt = "You are a helpful assistant for the company ZyntriQix. Your task is to correct any spelling discrepancies in the transcribed text. Make sure that the names of the following products are spelled correctly: ZyntriQix, Digique Plus, CynapseFive, VortiQore V8, EchoNix Array, OrbitalLink Seven, DigiFractal Matrix, PULSE, RAPT, B.R.I.C.K., Q.U.A.R.T.Z., F.L.I.N.T. Only add necessary punctuation such as periods, commas, and capitalization, and use only the context provided."

def generate_corrected_transcript(temperature, system_prompt, audio_file):
    response = client.chat.completions.create(
        model="gpt-4o",
        temperature=temperature,
        messages=[
            {
                "role": "system",
                "content": system_prompt
            },
            {
                "role": "user",
                "content": transcribe(audio_file, "")
            }
        ]
    )
    return completion.choices[0].message.content

corrected_text = generate_corrected_transcript(0, system_prompt, fake_company_filepath)

如果您在自己的音频文件上尝试此方法，您会发现 GPT成功纠正了转录中的许多拼写错误。由于上下文窗口较大，此方法可能比使用 Whisper 的提示参数更具可扩展性，并且更可靠，因为 GPT可以为 Whisper 提供指引。

开源Whisper

除了openai的api以外，Whisper还是一个开源的通用语言识别模型，它在大量多样化音频的数据集上进行训练，也是一个可以执行多语言语音识别、语音翻译和语言识别的多任务模型。

安装命令如下：

pip install -U openai-whisper

另外还需要安装ffmpeg命令行工具，windows下安装方法参考这个网页：https://phoenixnap.com/kb/ffmpeg-windows。

提供了五个不同大小的模型：tiny, base, small, medium, large。

示例代码如下：

import whisper

model = whisper.load_model("base")
result = model.transcribe("audio.mp3")
print(result["text"])

如果没有安装ffmpeg，运行上面的程序会提示找不到音频文件。