跳到主要内容
版本:1.1.3

Audio

Spring AI 提供语音转文字(Transcription)和文字转语音(TTS)两种音频能力。转写基于 OpenAI Whisper 模型,由 OpenAI 和 Azure OpenAI 两家厂商实现;语音合成支持 OpenAI 和 ElevenLabs 两家厂商,通过统一的 TextToSpeechModel 接口调用。


1. 概述

Spring AI 提供语音转文字(AudioTranscriptionModel)和统一的文字转语音(TextToSpeechModel / StreamingTextToSpeechModel)两套 API,统一了不同厂商的调用方式。


2. 语音转文字

2.1 转写请求

AudioTranscriptionPrompt 封装音频资源(Resource)和转写选项,实现了 ModelRequest<Resource>。支持的音频格式:mp3、mp4、mpeg、mpga、m4a、wav、webm。

// 最简构造:只提供音频文件
AudioTranscriptionPrompt prompt = new AudioTranscriptionPrompt(
new ClassPathResource("audio/interview.mp3"));

// 带选项构造
AudioTranscriptionPrompt prompt = new AudioTranscriptionPrompt(
new ClassPathResource("audio/meeting.mp3"),
OpenAiAudioTranscriptionOptions.builder()
.model("whisper-1")
.language("zh")
.temperature(0.2f)
.responseFormat(TranscriptResponseFormat.VERBOSE_JSON)
.build());

2.2 转写响应

AudioTranscriptionResponse response = model.call(prompt);

// 转写文本
AudioTranscription result = response.getResult();
String text = result.getOutput();

// 响应元数据
AudioTranscriptionResponseMetadata metadata = response.getMetadata();
List<AudioTranscription> allResults = response.getResults();

2.3 转写选项

便携式选项接口,仅定义模型选择。

public interface AudioTranscriptionOptions extends ModelOptions {
String getModel();
}

各厂商在此基础上扩展特有选项。


3. OpenAI 转写实现

3.1 使用转写模型

OpenAiAudioTranscriptionModel 实现了 Model<AudioTranscriptionPrompt, AudioTranscriptionResponse>,提供了简捷的 call(Resource) 重载,无需手动构造 AudioTranscriptionPrompt

3.2 自动注入

application.yml
spring:
ai:
openai:
api-key: ${OPENAI_API_KEY}
audio:
transcription:
options:
model: whisper-1
temperature: 0.7
@RestController
public class TranscriptionController {

private final OpenAiAudioTranscriptionModel transcriptionModel;

public TranscriptionController(OpenAiAudioTranscriptionModel transcriptionModel) {
this.transcriptionModel = transcriptionModel;
}

@PostMapping("/transcribe")
public String transcribe(@RequestParam("file") MultipartFile file) throws IOException {
Resource audioResource = new InputStreamResource(file.getInputStream());
return transcriptionModel.call(audioResource);
}
}

3.3 基础转写

call(Resource) 简捷方法一行完成转写。

完整示例:TranscriptionDemo.java
TranscriptionDemo.java
@Component
public class TranscriptionDemo implements CommandLineRunner {

private final OpenAiAudioTranscriptionModel transcriptionModel;

public TranscriptionDemo(OpenAiAudioTranscriptionModel transcriptionModel) {
this.transcriptionModel = transcriptionModel;
}

@Override
public void run(String... args) {
Resource audio = new ClassPathResource("audio/meeting.mp3");
String text = transcriptionModel.call(audio);
System.out.println("转写结果: " + text);
}
}

3.4 带参数转写

完整示例:TranscriptionOptionsDemo.java
TranscriptionOptionsDemo.java
@Component
public class TranscriptionOptionsDemo implements CommandLineRunner {

private final OpenAiAudioTranscriptionModel transcriptionModel;

public TranscriptionOptionsDemo(OpenAiAudioTranscriptionModel transcriptionModel) {
this.transcriptionModel = transcriptionModel;
}

@Override
public void run(String... args) {
OpenAiAudioTranscriptionOptions options = OpenAiAudioTranscriptionOptions.builder()
.model("whisper-1")
.language("zh")
.prompt("这是一段技术会议的录音")
.temperature(0.2f)
.responseFormat(TranscriptResponseFormat.VERBOSE_JSON)
.granularityType(GranularityType.SEGMENT)
.build();

AudioTranscriptionPrompt prompt = new AudioTranscriptionPrompt(
new ClassPathResource("audio/meeting.mp3"),
options
);

AudioTranscriptionResponse response = transcriptionModel.call(prompt);
System.out.println("转写: " + response.getResult().getOutput());
}
}

3.5 转写选项参数

方法默认值说明
model(String)whisper-1模型名称
language(String)ISO-639-1 语言代码,如 zhen
prompt(String)引导提示,帮助模型理解上下文和术语
temperature(Float)0.7采样温度,0-1,越低越确定
responseFormat(TranscriptResponseFormat)JSON响应格式,见下表
granularityType(GranularityType)时间戳粒度(仅 VERBOSE_JSON 格式生效)

TranscriptResponseFormat 枚举:

说明响应结构
JSONJSON 格式,仅文本{"text": "..."}
TEXT纯文本转写文本字符串
SRTSubRip 字幕格式带时间戳的字幕
VERBOSE_JSON详细 JSON含语言、时长、词级/段级时间戳
VTTWebVTT 字幕格式带时间戳的字幕

GranularityType 枚举:

说明
WORD词级时间戳
SEGMENT段级时间戳

3.6 详细响应数据

当使用 VERBOSE_JSON 格式时,响应包含丰富的结构化数据:

AudioTranscriptionResponse response = transcriptionModel.call(prompt);
OpenAiAudioTranscriptionResponseMetadata metadata =
(OpenAiAudioTranscriptionResponseMetadata) response.getMetadata();

// 速率限制(继承自 OpenAI 父类)
RateLimit rateLimit = metadata.getRateLimit();
rateLimit.getRequestsLimit();
rateLimit.getRequestsRemaining();

StructuredResponse 内部数据结构:

字段类型说明
languageString检测到的语言
durationFloat音频时长(秒)
textString转写全文
wordsList<Word>词级信息(word、start、end)
segmentsList<Segment>段级信息(id、seek、start、end、text、tokens、temperature 等)

4. Azure OpenAI 转写

Azure OpenAI 提供 Whisper 模型的转写能力,使用 AzureOpenAiAudioTranscriptionModel

application.yml
spring:
ai:
azure:
openai:
api-key: ${AZURE_OPENAI_API_KEY}
endpoint: ${AZURE_OPENAI_ENDPOINT}
audio:
transcription:
options:
deployment-name: whisper-1

AzureOpenAiAudioTranscriptionOptions 在 OpenAI 选项基础上新增:

字段说明
deploymentNameAzure 部署名称(优先于 model)
完整示例:AzureTranscriptionDemo.java
AzureTranscriptionDemo.java
@Component
public class AzureTranscriptionDemo implements CommandLineRunner {

private final AzureOpenAiAudioTranscriptionModel transcriptionModel;

public AzureTranscriptionDemo(AzureOpenAiAudioTranscriptionModel transcriptionModel) {
this.transcriptionModel = transcriptionModel;
}

@Override
public void run(String... args) {
AzureOpenAiAudioTranscriptionOptions options =
AzureOpenAiAudioTranscriptionOptions.builder()
.deploymentName("whisper-1")
.language("zh")
.temperature(0.2f)
.responseFormat(TranscriptResponseFormat.TEXT)
.build();

AudioTranscriptionPrompt prompt = new AudioTranscriptionPrompt(
new ClassPathResource("audio/lecture.mp3"),
options
);

String text = transcriptionModel.call(prompt).getResult().getOutput();
System.out.println("转写: " + text);
}
}

5. 文字转语音

5.1 统一的 TTS 接口

Spring AI 1.1.0 在 spring-ai-model 中新增 org.springframework.ai.audio.tts 包,提供跨厂商的统一 TTS 抽象。

接口/类说明
TextToSpeechModel同步 TTS 模型,call(TextToSpeechPrompt)TextToSpeechResponse
StreamingTextToSpeechModel流式 TTS 模型,stream(TextToSpeechPrompt)Flux<Speech>
TextToSpeechPromptTTS 请求,含 TextToSpeechMessage 列表和 TextToSpeechOptions
TextToSpeechResponseTTS 响应,含 Speech 对象和元数据
Speech语音结果,封装音频数据(byte[] / Resource)和格式
TextToSpeechOptions选项接口,支持 modelvoicespeedresponseFormat
DefaultTextToSpeechOptions默认选项实现,Builder 模式

破坏性变更:OpenAI 模块原有的 SpeechModelSpeechPromptSpeechMessage 已移除,迁移至统一的 TextToSpeechModel 接口。

5.2 流式语音合成

StreamingTextToSpeechModel 流式接口返回 Flux<Speech>,适合实时播放场景。

5.3 核心数据结构

TextToSpeechPrompt 封装文本消息和选项:

TextToSpeechPrompt prompt = new TextToSpeechPrompt("欢迎使用 Spring AI 语音合成");

OpenAiAudioSpeechOptions options = OpenAiAudioSpeechOptions.builder()
.model("tts-1-hd")
.voice("nova")
.speed(1.0f)
.responseFormat(AudioResponseFormat.MP3)
.build();

TextToSpeechPrompt promptWithOptions = new TextToSpeechPrompt(
new TextToSpeechMessage("你好,我是 Spring AI 语音助手"),
options);

TextToSpeechMessage 封装合成文本:

TextToSpeechMessage message = new TextToSpeechMessage("需要合成的文本内容");
String text = message.getText();

TextToSpeechResponse 返回合成音频:

TextToSpeechResponse response = speechModel.call(prompt);

Speech speech = response.getResult();
byte[] audioBytes = speech.getOutput();

TextToSpeechResponseMetadata metadata = response.getMetadata();

5.4 基础合成

完整示例:SpeechDemo.java
SpeechDemo.java
@Component
public class SpeechDemo implements CommandLineRunner {

private final TextToSpeechModel speechModel;

public SpeechDemo(TextToSpeechModel speechModel) {
this.speechModel = speechModel;
}

@Override
public void run(String... args) throws IOException {
TextToSpeechPrompt prompt = new TextToSpeechPrompt(
"Spring AI 让 Java 开发者轻松接入 AI 能力");
Speech speech = speechModel.call(prompt).getResult();
byte[] audio = speech.getOutput();

Path output = Path.of("output.mp3");
Files.write(output, audio);
System.out.println("音频已保存到: " + output.toAbsolutePath()
+ " (" + audio.length + " bytes)");
}
}

5.5 带参数合成

完整示例:SpeechOptionsDemo.java
SpeechOptionsDemo.java
@Component
public class SpeechOptionsDemo implements CommandLineRunner {

private final TextToSpeechModel speechModel;

public SpeechOptionsDemo(TextToSpeechModel speechModel) {
this.speechModel = speechModel;
}

@Override
public void run(String... args) {
OpenAiAudioSpeechOptions options = OpenAiAudioSpeechOptions.builder()
.model("tts-1-hd")
.voice("nova")
.speed(1.1f)
.responseFormat(AudioResponseFormat.MP3)
.build();

TextToSpeechPrompt prompt = new TextToSpeechPrompt(
new TextToSpeechMessage("Spring AI 是一个强大的 AI 集成框架"),
options
);

TextToSpeechResponse response = speechModel.call(prompt);

byte[] audio = response.getResult().getOutput();

System.out.println("音频大小: " + audio.length + " bytes");
}
}

5.6 多声音对比

OpenAI TTS 提供 9 种声音,适合不同场景需求。

完整示例:VoiceComparisonDemo.java
VoiceComparisonDemo.java
@Component
public class VoiceComparisonDemo implements CommandLineRunner {

private final TextToSpeechModel speechModel;

public VoiceComparisonDemo(TextToSpeechModel speechModel) {
this.speechModel = speechModel;
}

@Override
public void run(String... args) throws IOException {
String[] voices = {"alloy", "echo", "fable", "nova", "onyx",
"shimmer", "sage", "coral", "ash"};

for (String voice : voices) {
TextToSpeechPrompt prompt = new TextToSpeechPrompt(
new TextToSpeechMessage("Spring AI 是一个强大的 AI 集成框架"),
OpenAiAudioSpeechOptions.builder()
.model("tts-1")
.voice(voice)
.speed(1.0f)
.responseFormat(AudioResponseFormat.MP3)
.build()
);

byte[] audio = speechModel.call(prompt).getResult().getOutput();
Files.write(Path.of("speech-" + voice + ".mp3"), audio);
System.out.println(voice + " → " + audio.length + " bytes");
}
}
}

5.7 流式合成

完整示例:StreamingSpeechDemo.java
StreamingSpeechDemo.java
@Component
public class StreamingSpeechDemo implements CommandLineRunner {

private final StreamingTextToSpeechModel speechModel;

public StreamingSpeechDemo(StreamingTextToSpeechModel speechModel) {
this.speechModel = speechModel;
}

@Override
public void run(String... args) throws IOException {
OpenAiAudioSpeechOptions options = OpenAiAudioSpeechOptions.builder()
.model("tts-1")
.voice("nova")
.speed(1.0f)
.responseFormat(AudioResponseFormat.MP3)
.build();

TextToSpeechPrompt prompt = new TextToSpeechPrompt(
new TextToSpeechMessage("流式语音合成让音频可以边生成边播放,"
+ "适合需要低延迟响应的场景"),
options
);

Path output = Path.of("streaming-output.mp3");
try (FileOutputStream fos = new FileOutputStream(output.toFile())) {
speechModel.stream(prompt)
.doOnNext(speech -> {
try { fos.write(speech.getOutput()); }
catch (IOException e) { throw new RuntimeException(e); }
})
.blockLast();
}

System.out.println("流式音频已保存: " + output.toAbsolutePath());
}
}

6. 语音合成选项

方法默认值说明
model(String)tts-1tts-1(标准)/ tts-1-hd(高清)
input(String)待合成的文本,最大 4096 字符
voice(String)alloy声音选择,9 种可选
responseFormat(AudioResponseFormat)MP3输出音频格式,6 种可选
speed(Float)1.0语速,0.25-4.0

TtsModel 枚举

说明
TTS_1("tts-1")标准质量,更低延迟
TTS_1_HD("tts-1-hd")高清质量,更接近人声

Voice 枚举(9 种声音)

枚举性别风格
ALLOY中性温和自然
ECHO男声深沉稳重
FABLE英式男声叙述感强
ONYX男声深沉有力
NOVA女声柔和温暖
SHIMMER女声清晰明亮
SAGE男声成熟稳重
CORAL女声亲切自然
ASH男声平稳冷静

AudioResponseFormat 枚举(6 种格式)

枚举说明适用场景
MP3通用压缩格式通用播放,兼容性最好
OPUS低延迟压缩流式传输、实时通信
AAC数字音频压缩Apple 生态
FLAC无损压缩高保真存储
WAV未压缩 PCM音频处理、混音
PCM原始脉冲编码底层音频处理

7. 配置参考

转写配置

配置项默认值说明
spring.ai.openai.audio.transcription.options.modelwhisper-1转写模型
spring.ai.openai.audio.transcription.options.temperature0.7采样温度
spring.ai.openai.audio.transcription.options.response-formatjson / text / srt / verbose_json / vtt

语音合成配置

配置项默认值说明
spring.ai.openai.audio.speech.options.modeltts-1tts-1 / tts-1-hd
spring.ai.openai.audio.speech.options.voicealloy9 种声音
spring.ai.openai.audio.speech.options.speed1.00.25-4.0
spring.ai.openai.audio.speech.options.response-formatmp3mp3 / opus / aac / flac / wav / pcm

8. 完整综合示例

完整示例:AudioCompleteExample.java
AudioCompleteExample.java
@Component
public class AudioCompleteExample implements CommandLineRunner {

private final OpenAiAudioTranscriptionModel transcriptionModel;
private final TextToSpeechModel speechModel;

public AudioCompleteExample(
OpenAiAudioTranscriptionModel transcriptionModel,
TextToSpeechModel speechModel) {
this.transcriptionModel = transcriptionModel;
this.speechModel = speechModel;
}

@Override
public void run(String... args) throws IOException {
// 1. 转写音频
String transcribedText = transcriptionModel.call(
new ClassPathResource("audio/lecture.mp3"));
System.out.println("转写: " + transcribedText);

// 2. 生成总结(实际场景中交由 ChatModel 处理)
String summary = "本次讲座的核心观点:Spring AI 统一了不同 AI 厂商的调用方式";

// 3. 将总结合成为语音
TextToSpeechPrompt speechPrompt = new TextToSpeechPrompt(
new TextToSpeechMessage(summary),
OpenAiAudioSpeechOptions.builder()
.model("tts-1-hd")
.voice("nova")
.speed(1.1f)
.responseFormat(AudioResponseFormat.MP3)
.build()
);

byte[] audio = speechModel.call(speechPrompt).getResult().getOutput();
Files.write(Path.of("summary.mp3"), audio);

System.out.println("已生成总结音频: " + audio.length + " bytes");
}
}