版本：1.1.3

Audio

Spring AI 提供语音转文字（Transcription）和文字转语音（TTS）两种音频能力。转写基于 OpenAI Whisper 模型，由 OpenAI 和 Azure OpenAI 两家厂商实现；语音合成支持 OpenAI 和 ElevenLabs 两家厂商，通过统一的 TextToSpeechModel 接口调用。

1. 概述

Spring AI 提供语音转文字（AudioTranscriptionModel）和统一的文字转语音（TextToSpeechModel / StreamingTextToSpeechModel）两套 API，统一了不同厂商的调用方式。

2. 语音转文字

2.1 转写请求

AudioTranscriptionPrompt 封装音频资源（Resource）和转写选项，实现了 ModelRequest<Resource>。支持的音频格式：mp3、mp4、mpeg、mpga、m4a、wav、webm。

// 最简构造：只提供音频文件
AudioTranscriptionPrompt prompt = new AudioTranscriptionPrompt(
        new ClassPathResource("audio/interview.mp3"));

// 带选项构造
AudioTranscriptionPrompt prompt = new AudioTranscriptionPrompt(
        new ClassPathResource("audio/meeting.mp3"),
        OpenAiAudioTranscriptionOptions.builder()
                .model("whisper-1")
                .language("zh")
                .temperature(0.2f)
                .responseFormat(TranscriptResponseFormat.VERBOSE_JSON)
                .build());

2.2 转写响应

AudioTranscriptionResponse response = model.call(prompt);

// 转写文本
AudioTranscription result = response.getResult();
String text = result.getOutput();

// 响应元数据
AudioTranscriptionResponseMetadata metadata = response.getMetadata();
List<AudioTranscription> allResults = response.getResults();

2.3 转写选项

便携式选项接口，仅定义模型选择。

public interface AudioTranscriptionOptions extends ModelOptions {
    String getModel();
}

各厂商在此基础上扩展特有选项。

3. OpenAI 转写实现

3.1 使用转写模型

OpenAiAudioTranscriptionModel 实现了 Model<AudioTranscriptionPrompt, AudioTranscriptionResponse>，提供了简捷的 call(Resource) 重载，无需手动构造 AudioTranscriptionPrompt。

3.2 自动注入

application.yml
spring:
  ai:
    openai:
      api-key: ${OPENAI_API_KEY}
      audio:
        transcription:
          options:
            model: whisper-1
            temperature: 0.7

@RestController
public class TranscriptionController {

    private final OpenAiAudioTranscriptionModel transcriptionModel;

    public TranscriptionController(OpenAiAudioTranscriptionModel transcriptionModel) {
        this.transcriptionModel = transcriptionModel;
    }

    @PostMapping("/transcribe")
    public String transcribe(@RequestParam("file") MultipartFile file) throws IOException {
        Resource audioResource = new InputStreamResource(file.getInputStream());
        return transcriptionModel.call(audioResource);
    }
}

3.3 基础转写

call(Resource) 简捷方法一行完成转写。

完整示例：TranscriptionDemo.java

TranscriptionDemo.java
@Component
public class TranscriptionDemo implements CommandLineRunner {

    private final OpenAiAudioTranscriptionModel transcriptionModel;

    public TranscriptionDemo(OpenAiAudioTranscriptionModel transcriptionModel) {
        this.transcriptionModel = transcriptionModel;
    }

    @Override
    public void run(String... args) {
        Resource audio = new ClassPathResource("audio/meeting.mp3");
        String text = transcriptionModel.call(audio);
        System.out.println("转写结果: " + text);
    }
}

3.4 带参数转写

完整示例：TranscriptionOptionsDemo.java

TranscriptionOptionsDemo.java
@Component
public class TranscriptionOptionsDemo implements CommandLineRunner {

    private final OpenAiAudioTranscriptionModel transcriptionModel;

    public TranscriptionOptionsDemo(OpenAiAudioTranscriptionModel transcriptionModel) {
        this.transcriptionModel = transcriptionModel;
    }

    @Override
    public void run(String... args) {
        OpenAiAudioTranscriptionOptions options = OpenAiAudioTranscriptionOptions.builder()
                .model("whisper-1")
                .language("zh")
                .prompt("这是一段技术会议的录音")
                .temperature(0.2f)
                .responseFormat(TranscriptResponseFormat.VERBOSE_JSON)
                .granularityType(GranularityType.SEGMENT)
                .build();

        AudioTranscriptionPrompt prompt = new AudioTranscriptionPrompt(
                new ClassPathResource("audio/meeting.mp3"),
                options
        );

        AudioTranscriptionResponse response = transcriptionModel.call(prompt);
        System.out.println("转写: " + response.getResult().getOutput());
    }
}

3.5 转写选项参数

方法	默认值	说明
`model(String)`	`whisper-1`	模型名称
`language(String)`	—	ISO-639-1 语言代码，如 `zh`、`en`
`prompt(String)`	—	引导提示，帮助模型理解上下文和术语
`temperature(Float)`	`0.7`	采样温度，0-1，越低越确定
`responseFormat(TranscriptResponseFormat)`	`JSON`	响应格式，见下表
`granularityType(GranularityType)`	—	时间戳粒度（仅 `VERBOSE_JSON` 格式生效）

TranscriptResponseFormat 枚举：

值	说明	响应结构
`JSON`	JSON 格式，仅文本	`{"text": "..."}`
`TEXT`	纯文本	转写文本字符串
`SRT`	SubRip 字幕格式	带时间戳的字幕
`VERBOSE_JSON`	详细 JSON	含语言、时长、词级/段级时间戳
`VTT`	WebVTT 字幕格式	带时间戳的字幕

GranularityType 枚举：

值	说明
`WORD`	词级时间戳
`SEGMENT`	段级时间戳

3.6 详细响应数据

当使用 VERBOSE_JSON 格式时，响应包含丰富的结构化数据：

AudioTranscriptionResponse response = transcriptionModel.call(prompt);
OpenAiAudioTranscriptionResponseMetadata metadata =
        (OpenAiAudioTranscriptionResponseMetadata) response.getMetadata();

// 速率限制（继承自 OpenAI 父类）
RateLimit rateLimit = metadata.getRateLimit();
rateLimit.getRequestsLimit();
rateLimit.getRequestsRemaining();

StructuredResponse 内部数据结构：

字段	类型	说明
`language`	`String`	检测到的语言
`duration`	`Float`	音频时长（秒）
`text`	`String`	转写全文
`words`	`List<Word>`	词级信息（word、start、end）
`segments`	`List<Segment>`	段级信息（id、seek、start、end、text、tokens、temperature 等）

4. Azure OpenAI 转写

Azure OpenAI 提供 Whisper 模型的转写能力，使用 AzureOpenAiAudioTranscriptionModel。

application.yml
spring:
  ai:
    azure:
      openai:
        api-key: ${AZURE_OPENAI_API_KEY}
        endpoint: ${AZURE_OPENAI_ENDPOINT}
        audio:
          transcription:
            options:
              deployment-name: whisper-1

AzureOpenAiAudioTranscriptionOptions 在 OpenAI 选项基础上新增：

字段	说明
`deploymentName`	Azure 部署名称（优先于 model）

完整示例：AzureTranscriptionDemo.java

AzureTranscriptionDemo.java
@Component
public class AzureTranscriptionDemo implements CommandLineRunner {

    private final AzureOpenAiAudioTranscriptionModel transcriptionModel;

    public AzureTranscriptionDemo(AzureOpenAiAudioTranscriptionModel transcriptionModel) {
        this.transcriptionModel = transcriptionModel;
    }

    @Override
    public void run(String... args) {
        AzureOpenAiAudioTranscriptionOptions options =
                AzureOpenAiAudioTranscriptionOptions.builder()
                        .deploymentName("whisper-1")
                        .language("zh")
                        .temperature(0.2f)
                        .responseFormat(TranscriptResponseFormat.TEXT)
                        .build();

        AudioTranscriptionPrompt prompt = new AudioTranscriptionPrompt(
                new ClassPathResource("audio/lecture.mp3"),
                options
        );

        String text = transcriptionModel.call(prompt).getResult().getOutput();
        System.out.println("转写: " + text);
    }
}

5. 文字转语音

5.1 统一的 TTS 接口

Spring AI 1.1.0 在 spring-ai-model 中新增 org.springframework.ai.audio.tts 包，提供跨厂商的统一 TTS 抽象。

接口/类	说明
`TextToSpeechModel`	同步 TTS 模型，`call(TextToSpeechPrompt)` → `TextToSpeechResponse`
`StreamingTextToSpeechModel`	流式 TTS 模型，`stream(TextToSpeechPrompt)` → `Flux<Speech>`
`TextToSpeechPrompt`	TTS 请求，含 `TextToSpeechMessage` 列表和 `TextToSpeechOptions`
`TextToSpeechResponse`	TTS 响应，含 `Speech` 对象和元数据
`Speech`	语音结果，封装音频数据（`byte[]` / `Resource`）和格式
`TextToSpeechOptions`	选项接口，支持 `model`、`voice`、`speed`、`responseFormat`
`DefaultTextToSpeechOptions`	默认选项实现，Builder 模式

破坏性变更：OpenAI 模块原有的 SpeechModel、SpeechPrompt、SpeechMessage 已移除，迁移至统一的 TextToSpeechModel 接口。

5.2 流式语音合成

StreamingTextToSpeechModel 流式接口返回 Flux<Speech>，适合实时播放场景。

5.3 核心数据结构

TextToSpeechPrompt 封装文本消息和选项：

TextToSpeechPrompt prompt = new TextToSpeechPrompt("欢迎使用 Spring AI 语音合成");

OpenAiAudioSpeechOptions options = OpenAiAudioSpeechOptions.builder()
        .model("tts-1-hd")
        .voice("nova")
        .speed(1.0f)
        .responseFormat(AudioResponseFormat.MP3)
        .build();

TextToSpeechPrompt promptWithOptions = new TextToSpeechPrompt(
        new TextToSpeechMessage("你好，我是 Spring AI 语音助手"),
        options);

TextToSpeechMessage 封装合成文本：

TextToSpeechMessage message = new TextToSpeechMessage("需要合成的文本内容");
String text = message.getText();

TextToSpeechResponse 返回合成音频：

TextToSpeechResponse response = speechModel.call(prompt);

Speech speech = response.getResult();
byte[] audioBytes = speech.getOutput();

TextToSpeechResponseMetadata metadata = response.getMetadata();

5.4 基础合成

完整示例：SpeechDemo.java

SpeechDemo.java
@Component
public class SpeechDemo implements CommandLineRunner {

    private final TextToSpeechModel speechModel;

    public SpeechDemo(TextToSpeechModel speechModel) {
        this.speechModel = speechModel;
    }

    @Override
    public void run(String... args) throws IOException {
        TextToSpeechPrompt prompt = new TextToSpeechPrompt(
                "Spring AI 让 Java 开发者轻松接入 AI 能力");
        Speech speech = speechModel.call(prompt).getResult();
        byte[] audio = speech.getOutput();

        Path output = Path.of("output.mp3");
        Files.write(output, audio);
        System.out.println("音频已保存到: " + output.toAbsolutePath()
                + " (" + audio.length + " bytes)");
    }
}

5.5 带参数合成

完整示例：SpeechOptionsDemo.java

SpeechOptionsDemo.java
@Component
public class SpeechOptionsDemo implements CommandLineRunner {

    private final TextToSpeechModel speechModel;

    public SpeechOptionsDemo(TextToSpeechModel speechModel) {
        this.speechModel = speechModel;
    }

    @Override
    public void run(String... args) {
        OpenAiAudioSpeechOptions options = OpenAiAudioSpeechOptions.builder()
                .model("tts-1-hd")
                .voice("nova")
                .speed(1.1f)
                .responseFormat(AudioResponseFormat.MP3)
                .build();

        TextToSpeechPrompt prompt = new TextToSpeechPrompt(
                new TextToSpeechMessage("Spring AI 是一个强大的 AI 集成框架"),
                options
        );

        TextToSpeechResponse response = speechModel.call(prompt);

        byte[] audio = response.getResult().getOutput();

        System.out.println("音频大小: " + audio.length + " bytes");
    }
}

5.6 多声音对比

OpenAI TTS 提供 9 种声音，适合不同场景需求。

完整示例：VoiceComparisonDemo.java

VoiceComparisonDemo.java
@Component
public class VoiceComparisonDemo implements CommandLineRunner {

    private final TextToSpeechModel speechModel;

    public VoiceComparisonDemo(TextToSpeechModel speechModel) {
        this.speechModel = speechModel;
    }

    @Override
    public void run(String... args) throws IOException {
        String[] voices = {"alloy", "echo", "fable", "nova", "onyx",
                "shimmer", "sage", "coral", "ash"};

        for (String voice : voices) {
            TextToSpeechPrompt prompt = new TextToSpeechPrompt(
                    new TextToSpeechMessage("Spring AI 是一个强大的 AI 集成框架"),
                    OpenAiAudioSpeechOptions.builder()
                            .model("tts-1")
                            .voice(voice)
                            .speed(1.0f)
                            .responseFormat(AudioResponseFormat.MP3)
                            .build()
            );

            byte[] audio = speechModel.call(prompt).getResult().getOutput();
            Files.write(Path.of("speech-" + voice + ".mp3"), audio);
            System.out.println(voice + " → " + audio.length + " bytes");
        }
    }
}

5.7 流式合成

完整示例：StreamingSpeechDemo.java

StreamingSpeechDemo.java
@Component
public class StreamingSpeechDemo implements CommandLineRunner {

    private final StreamingTextToSpeechModel speechModel;

    public StreamingSpeechDemo(StreamingTextToSpeechModel speechModel) {
        this.speechModel = speechModel;
    }

    @Override
    public void run(String... args) throws IOException {
        OpenAiAudioSpeechOptions options = OpenAiAudioSpeechOptions.builder()
                .model("tts-1")
                .voice("nova")
                .speed(1.0f)
                .responseFormat(AudioResponseFormat.MP3)
                .build();

        TextToSpeechPrompt prompt = new TextToSpeechPrompt(
                new TextToSpeechMessage("流式语音合成让音频可以边生成边播放，"
                        + "适合需要低延迟响应的场景"),
                options
        );

        Path output = Path.of("streaming-output.mp3");
        try (FileOutputStream fos = new FileOutputStream(output.toFile())) {
            speechModel.stream(prompt)
                    .doOnNext(speech -> {
                        try { fos.write(speech.getOutput()); }
                        catch (IOException e) { throw new RuntimeException(e); }
                    })
                    .blockLast();
        }

        System.out.println("流式音频已保存: " + output.toAbsolutePath());
    }
}

6. 语音合成选项

方法	默认值	说明
`model(String)`	`tts-1`	tts-1（标准）/ tts-1-hd（高清）
`input(String)`	—	待合成的文本，最大 4096 字符
`voice(String)`	`alloy`	声音选择，9 种可选
`responseFormat(AudioResponseFormat)`	`MP3`	输出音频格式，6 种可选
`speed(Float)`	`1.0`	语速，0.25-4.0

TtsModel 枚举

值	说明
`TTS_1("tts-1")`	标准质量，更低延迟
`TTS_1_HD("tts-1-hd")`	高清质量，更接近人声

Voice 枚举（9 种声音）

枚举	性别	风格
`ALLOY`	中性	温和自然
`ECHO`	男声	深沉稳重
`FABLE`	英式男声	叙述感强
`ONYX`	男声	深沉有力
`NOVA`	女声	柔和温暖
`SHIMMER`	女声	清晰明亮
`SAGE`	男声	成熟稳重
`CORAL`	女声	亲切自然
`ASH`	男声	平稳冷静

AudioResponseFormat 枚举（6 种格式）

枚举	说明	适用场景
`MP3`	通用压缩格式	通用播放，兼容性最好
`OPUS`	低延迟压缩	流式传输、实时通信
`AAC`	数字音频压缩	Apple 生态
`FLAC`	无损压缩	高保真存储
`WAV`	未压缩 PCM	音频处理、混音
`PCM`	原始脉冲编码	底层音频处理

7. 配置参考

OpenAI
Azure OpenAI

转写配置

配置项	默认值	说明
`spring.ai.openai.audio.transcription.options.model`	`whisper-1`	转写模型
`spring.ai.openai.audio.transcription.options.temperature`	`0.7`	采样温度
`spring.ai.openai.audio.transcription.options.response-format`	—	json / text / srt / verbose_json / vtt

语音合成配置

配置项	默认值	说明
`spring.ai.openai.audio.speech.options.model`	`tts-1`	tts-1 / tts-1-hd
`spring.ai.openai.audio.speech.options.voice`	`alloy`	9 种声音
`spring.ai.openai.audio.speech.options.speed`	`1.0`	0.25-4.0
`spring.ai.openai.audio.speech.options.response-format`	`mp3`	mp3 / opus / aac / flac / wav / pcm

转写配置

配置项	默认值	说明
`spring.ai.azure.openai.audio.transcription.options.deployment-name`	—	Azure 部署名
`spring.ai.azure.openai.audio.transcription.options.temperature`	—	采样温度

8. 完整综合示例

完整示例：AudioCompleteExample.java

AudioCompleteExample.java
@Component
public class AudioCompleteExample implements CommandLineRunner {

    private final OpenAiAudioTranscriptionModel transcriptionModel;
    private final TextToSpeechModel speechModel;

    public AudioCompleteExample(
            OpenAiAudioTranscriptionModel transcriptionModel,
            TextToSpeechModel speechModel) {
        this.transcriptionModel = transcriptionModel;
        this.speechModel = speechModel;
    }

    @Override
    public void run(String... args) throws IOException {
        // 1. 转写音频
        String transcribedText = transcriptionModel.call(
                new ClassPathResource("audio/lecture.mp3"));
        System.out.println("转写: " + transcribedText);

        // 2. 生成总结（实际场景中交由 ChatModel 处理）
        String summary = "本次讲座的核心观点：Spring AI 统一了不同 AI 厂商的调用方式";

        // 3. 将总结合成为语音
        TextToSpeechPrompt speechPrompt = new TextToSpeechPrompt(
                new TextToSpeechMessage(summary),
                OpenAiAudioSpeechOptions.builder()
                        .model("tts-1-hd")
                        .voice("nova")
                        .speed(1.1f)
                        .responseFormat(AudioResponseFormat.MP3)
                        .build()
        );

        byte[] audio = speechModel.call(speechPrompt).getResult().getOutput();
        Files.write(Path.of("summary.mp3"), audio);

        System.out.println("已生成总结音频: " + audio.length + " bytes");
    }
}

1. 概述​

2. 语音转文字​

2.1 转写请求​

2.2 转写响应​

2.3 转写选项​

3. OpenAI 转写实现​

3.1 使用转写模型​

3.2 自动注入​

3.3 基础转写​

3.4 带参数转写​

3.5 转写选项参数​

3.6 详细响应数据​

4. Azure OpenAI 转写​

5. 文字转语音​

5.1 统一的 TTS 接口​

5.2 流式语音合成​

5.3 核心数据结构​

5.4 基础合成​

5.5 带参数合成​

5.6 多声音对比​

5.7 流式合成​

6. 语音合成选项​

TtsModel 枚举​

Voice 枚举（9 种声音）​

AudioResponseFormat 枚举（6 种格式）​

7. 配置参考​

转写配置​

语音合成配置​

转写配置​

8. 完整综合示例​

1. 概述

2. 语音转文字

2.1 转写请求

2.2 转写响应

2.3 转写选项

3. OpenAI 转写实现

3.1 使用转写模型

3.2 自动注入

3.3 基础转写

3.4 带参数转写

3.5 转写选项参数

3.6 详细响应数据

4. Azure OpenAI 转写

5. 文字转语音

5.1 统一的 TTS 接口

5.2 流式语音合成

5.3 核心数据结构

5.4 基础合成

5.5 带参数合成

5.6 多声音对比

5.7 流式合成

6. 语音合成选项

TtsModel 枚举

Voice 枚举（9 种声音）

AudioResponseFormat 枚举（6 种格式）

7. 配置参考

转写配置

语音合成配置

转写配置

8. 完整综合示例