> ## Documentation Index
> Fetch the complete documentation index at: https://platform.stepfun.com/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# 实时对话开发指南

> 如果你是 Realtime API 的新手，可尝试 [stepfun-realtime-api](https://www.npmjs.com/package/stepfun-realtime-api)。

本指南将介绍使用 Realtime API 时模型功能（如音频生成、文本生成、函数调用）所需的事件流程，以及如何理解实时会话的状态。
在模型生成响应的过程中，服务器会发出多个生命周期事件。您可以监听这些事件（如 `response.text.delta`），在响应生成时向用户提供实时反馈。
服务器发出的事件完整列表见[Realtime API 事件列表](/zh/api-reference/realtime/chat)，它们按大致的发出顺序排列，同时包含文本生成相关的客户端事件。

## 实时语音到语音会话

实时会话是模型与连接的客户端之间的有状态交互。会话的关键组成部分包括：

* **会话（Session）** 对象：控制交互参数，如使用的模型、生成输出的语音、其他配置等。
* **对话（Conversation）**：表示当前会话中生成的用户输入项和模型输出项。
* **响应（Responses）**：模型生成的音频或文本项，会被添加到对话中。

**输入音频缓冲区与 WebSocket**

如果使用 WebSocket 处理音频，需要通过发送包含 base64编码音频的 JSON 事件，手动与**输入音频缓冲区**交互。

所有这些组件共同构成了实时会话。您将使用客户端事件更新会话状态，并监听服务器事件以响应会话中的状态变化。

如需手动建立连接，也可通过 WebSocket 直接对接。

## 快速开始:创建 WebSocket 语音链路

通过 WebSocket 连接需使用以下连接信息：

| 类别   | 详情                                                                   |
| ---- | -------------------------------------------------------------------- |
| URL  | wss\://api.stepfun.com/v1/realtime                                   |
| 查询参数 | model：需连接的实时模型，比如 `step-audio-2`/`step-audio-2-mini`/`step-1o-audio` |
| 请求头  | Authorization: Bearer YOUR\_API\_KEY                                 |

以下是多个通过上述连接信息初始化 WebSocket 连接的示例，可用于对接 Realtime API：

### 示例1：使用 ws 模块（Node.js 环境）

```javascript theme={null}
import WebSocket from 'ws'

// 连接地址（需替换为实际要使用的模型ID）
const url = 'wss://api.stepfun.com/v1/realtime?model=step-1o-audio'
// 初始化WebSocket连接
const ws = new WebSocket(url, {
	headers: {
		Authorization: 'Bearer ' + process.env.STEPFUN_API_KEY, // 从环境变量中获取标准API密钥
	},
})

// 监听“连接成功”事件
ws.on('open', function open() {
	console.log('已连接到服务器。')
})

// 监听“接收消息”事件
ws.on('message', function incoming(message) {
	console.log(JSON.parse(message.toString())) // 解析并打印接收的消息
})
```

### 示例2：使用 websocket-client 库（Python 环境）

```python theme={null}
# 注：此示例需依赖websocket-client库，安装命令：pip install websocket-client

import os
import json
import websocket

# 从环境变量中获取OpenAI API密钥
STEPFUN_API_KEY = os.environ.get("STEPFUN_API_KEY")

# 连接地址（需替换为实际要使用的模型ID）
url = "wss://api.stepfun.com/v1/realtime?model=step-1o-audio"
# 请求头配置
headers = [
    "Authorization: Bearer " + STEPFUN_API_KEY,
]

# 定义“连接成功”回调函数
def on_open(ws):
    print("已连接到服务器。")

# 定义“接收消息”回调函数
def on_message(ws, message):
    data = json.loads(message) # 解析JSON格式的消息
    print("收到事件：", json.dumps(data, indent=2)) # 格式化打印接收的事件

# 初始化WebSocket应用并配置回调
ws = websocket.WebSocketApp(
    url,
    header=headers,
    on_open=on_open,
    on_message=on_message,
)

# 保持WebSocket连接
ws.run_forever()
```

一旦您通过 WebSocket 连接到实时 API，就可以调用实时模型进行语音到语音对话。这需要您**发送客户端事件**来发起操作，并**监听服务器事件**以响应实时 API 执行的操作。

## 会话生命周期事件

通过 WebSocket 启动会话后，服务器会发送 `session.created` 事件，表明会话已准备就绪。在客户端，您可以通过 `session.update` 事件更新当前会话配置。大多数会话属性可以随时更新，但模型用于音频输出的 `voice`（语音）在会话期间模型首次以音频响应后，就无法再修改。实时会话的最长持续时间为**30分钟**。

以下示例展示了通过 `session.update` 客户端事件更新会话的操作。

更新本会话中模型使用的系统指令

```javascript theme={null}
const event = {
	type: 'session.update',
	session: {
		instructions: "在回复中绝对不要使用'moist'这个词！",
	},
}

// WebSocket都有.send()方法
dataChannel.send(JSON.stringify(event))
```

```python theme={null}
event = {
    "type": "session.update",
    "session": {
        "instructions": "在回复中绝对不要使用'moist'这个词！"
    }
}
ws.send(json.dumps(event))
```

当会话更新后，服务器会发出 `session.updated` 事件，包含会话的新状态。

| 相关客户端事件        | 相关服务器事件                                |
| -------------- | -------------------------------------- |
| session.update | session.created <br /> session.updated |

## 文本输入与输出

要使用实时模型生成文本，您可以向当前对话添加文本输入，请求模型生成响应，并监听服务器发送的事件以了解模型响应的进度。为了生成文本，会话必须配置为 `text`（文本）模态（默认已启用）。

使用 `conversation.item.create` 客户端事件创建新的文本对话项。

创建包含用户输入的对话项

```javascript theme={null}
const event = {
	type: 'conversation.item.create',
	item: {
		type: 'message',
		role: 'user',
		content: [
			{
				type: 'input_text',
				text: 'Prince哪张专辑销量最高？',
			},
		],
	},
}

// WebSocket都有.send()方法
dataChannel.send(JSON.stringify(event))
```

```python theme={null}
event = {
    "type": "conversation.item.create",
    "item": {
        "type": "message",
        "role": "user",
        "content": [
            {
                "type": "input_text",
                "text": "Prince哪张专辑销量最高？",
            }
        ]
    }
}
ws.send(json.dumps(event))
```

将用户消息添加到对话后，发送 `response.create` 事件以启动模型响应。如果当前会话同时启用了音频和文本，模型将同时返回音频和文本内容。如果只想生成文本，可以在发送 `response.create` 客户端事件时指定，如下所示。

生成仅文本响应

```javascript theme={null}
const event = {
	type: 'response.create',
	response: {
		modalities: ['text'],
	},
}

// WebSocket都有.send()方法
dataChannel.send(JSON.stringify(event))
```

```python theme={null}
event = {
    "type": "response.create",
    "response": {
        "modalities": [ "text" ]
    }
}
ws.send(json.dumps(event))
```

当响应完全生成后，服务器会发出 `response.done` 事件。该事件将包含模型生成的完整文本，如下所示。

监听 response.done 以查看最终结果

```javascript theme={null}
function handleEvent(e) {
	const serverEvent = JSON.parse(e.data)
	if (serverEvent.type === 'response.done') {
		console.log(serverEvent.response.output[0])
	}
}

// 监听服务器消息（WebSocket）
ws.on('message', handleEvent)
```

```python theme={null}
def on_message(ws, message):
    server_event = json.loads(message)
    if server_event.type == "response.done":
        print(server_event.response.output[0])
```

在模型生成响应的过程中，服务器会发出多个生命周期事件。您可以监听这些事件（如 `response.text.delta`），在响应生成时向用户提供实时反馈。服务器发出的事件完整列表见下文的“相关服务器事件”，它们按大致的发出顺序排列，同时包含文本生成相关的客户端事件。

| 事件类型                     | 说明                                                                                                                                                                                                                                                                          |
| ------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| conversation.item.create | conversation.item.created <br /> response.output\_item.added <br /> response.content\_part.added <br /> response.text.delta <br /> response.text.done <br /> response.content\_part.done <br /> response.output\_item.done <br /> response.done <br /> rate\_limits.updated |

## 音频输入与输出

实时 API 最强大的功能之一是与模型进行语音到语音交互，无需中间的文本转语音或语音转文本步骤。这降低了语音界面的延迟，并为模型提供了更多关于语音输入的语调和抑扬顿挫的信息。

### 语音选项

实时会话支持配置多种内置语音用于音频输出。您可以通过 `session.update` 或 `response.create` 请求中的 `voice` 参数来设置模型的声音。当前的语音选项包括 `qingchunshaonv`、`wenrounansheng`、`elegantgentle-female`、`livelybreezy-female` 等。

> 请注意，一旦模型在会话中生成了音频，该会话的 `voice` 设置将无法修改。

> `step-audio-2` 模型支持音色复刻功能。您可以通过上传音频文件创建自定义音色，并在实时会话的 `voice` 参数中使用其 ID。详情请参考[复刻音色](/zh/api-reference/audio/create-voice)。

### 使用 WebSocket 处理音频

通过 WebSocket 发送和接收音频时，您需要做更多工作来从客户端发送媒体和从服务器接收媒体。下面是一个表格，描述了 WebSocket 会话期间通过 WebSocket 发送和接收音频所需的事件流程。

以下事件按生命周期顺序排列，但有些事件（如 `delta` 事件）可能同时发生。

| 生命周期阶段  | 客户端事件                                                                                                                                                         | 服务器事件                                                                                                                                                                                                                                                                                                                                                                                                                                       |
| ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| 会话初始化   | session.update                                                                                                                                                | session.created<br />session.updated                                                                                                                                                                                                                                                                                                                                                                                                        |
| 用户音频输入  | conversation.item.create（发送完整音频消息）<br />input\_audio\_buffer.append（分块流式传输音频）<br /> input\_audio\_buffer.commit（禁用 VAD 时使用）<br /> response.create（禁用 VAD 时使用） | input\_audio\_buffer.speech\_started <br /> input\_audio\_buffer.speech\_stopped<br />input\_audio\_buffer.committed                                                                                                                                                                                                                                                                                                                        |
| 服务器音频输出 | input\_audio\_buffer.clear（禁用 VAD 时，或者需要主动创建一轮输出时使用）                                                                                                          | conversation.item.created <br /> response.created <br /> response.output\_item.created <br /> response.content\_part.added <br /> response.audio.delta <br /> response.audio\_transcript.delta <br /> response.text.delta <br /> response.audio.done <br /> response.audio\_transcript.done <br /> response.text.done <br /> response.content\_part.done <br /> response.output\_item.done <br /> response.done <br /> rate\_limits.updated |

### 向服务器流式传输音频输入

要向服务器流式传输音频输入，可以使用 `input_audio_buffer.append` 客户端事件。该事件要求您通过套接字向实时 API 发送**Base64编码的音频字节**块。每个块的大小不能超过15 MB。

输入块的格式可以为整个会话配置，也可以为每个响应单独配置。

* 会话级：`session.update` 中的 `session.input_audio_format`
* 响应级：`response.create` 中的 `response.input_audio_format`

向对话附加音频输入字节

```javascript theme={null}
import fs from 'fs';
import decodeAudio from 'audio-decode';

// 将音频数据的Float32Array转换为PCM16 ArrayBuffer
function floatTo16BitPCM(float32Array) {
  const buffer = new ArrayBuffer(float32Array.length * 2);
  const view = new DataView(buffer);
  let offset = 0;
  for (let i = 0; i < float32Array.length; i++, offset += 2) {
    let s = Math.max(-1, Math.min(1, float32Array[i]));
    view.setInt16(offset, s < 0 ? s * 0x8000 : s * 0x7fff, true);
  }
  return buffer;
}

// 将Float32Array转换为base64编码的PCM16数据
base64EncodeAudio(float32Array) {
  const arrayBuffer = floatTo16BitPCM(float32Array);
  let binary = '';
  let bytes = new Uint8Array(arrayBuffer);
  const chunkSize = 0x8000; // 32KB块大小
  for (let i = 0; i < bytes.length; i += chunkSize) {
    let chunk = bytes.subarray(i, i + chunkSize);
    binary += String.fromCharCode.apply(null, chunk);
  }
  return btoa(binary);
}

// 用三个文件的内容填充音频缓冲区，然后请求模型生成响应
const files = [
  './path/to/sample1.wav',
  './path/to/sample2.wav',
  './path/to/sample3.wav'
];

for (const filename of files) {
  const audioFile = fs.readFileSync(filename);
  const audioBuffer = await decodeAudio(audioFile);
  const channelData = audioBuffer.getChannelData(0);
  const base64Chunk = base64EncodeAudio(channelData);
  ws.send(JSON.stringify({
    type: 'input_audio_buffer.append',
    audio: base64Chunk
  }));
});

ws.send(JSON.stringify({type: 'input_audio_buffer.commit'}));
ws.send(JSON.stringify({type: 'response.create'}));
```

```python theme={null}
import base64
import json
import struct
import soundfile as sf
from websocket import create_connection

# ... 创建名为ws的websocket-client ...

def float_to_16bit_pcm(float32_array):
    clipped = [max(-1.0, min(1.0, x)) for x in float32_array]
    pcm16 = b''.join(struct.pack('<h', int(x * 32767)) for x in clipped)
    return pcm16

def base64_encode_audio(float32_array):
    pcm_bytes = float_to_16bit_pcm(float32_array)
    encoded = base64.b64encode(pcm_bytes).decode('ascii')
    return encoded

files = [
    './path/to/sample1.wav',
    './path/to/sample2.wav',
    './path/to/sample3.wav'
]

for filename in files:
    data, samplerate = sf.read(filename, dtype='float32')
    channel_data = data[:, 0] if data.ndim > 1 else data
    base64_chunk = base64_encode_audio(channel_data)

    # 发送客户端事件
    event = {
        "type": "input_audio_buffer.append",
        "audio": base64_chunk
    }
    ws.send(json.dumps(event))
```

### 发送完整音频消息

也可以创建包含完整音频录制的对话消息。使用 `conversation.item.create` 客户端事件创建带有 `input_audio` 内容的消息。

创建完整音频输入对话项

```javascript theme={null}
const fullAudio = '<音频字节的base64编码字符串>'

const event = {
	type: 'conversation.item.create',
	item: {
		type: 'message',
		role: 'user',
		content: [
			{
				type: 'input_audio',
				audio: fullAudio,
			},
		],
	},
}

// WebSocket 有.send()方法
dataChannel.send(JSON.stringify(event))
```

```python theme={null}
fullAudio = "<音频字节的base64编码字符串>"

event = {
    "type": "conversation.item.create",
    "item": {
        "type": "message",
        "role": "user",
        "content": [
            {
                "type": "input_audio",
                "audio": fullAudio,
            }
        ],
    },
}

ws.send(json.dumps(event))
```

### 处理 WebSocket 的音频输出

您需要监听 `response.audio.delta` 事件，其中包含模型发来的 base64编码音频数据块。您需要缓冲这些块并将其写入文件。

请注意，`response.audio.done` 和 `response.done` 事件实际上不包含音频数据，只包含音频内容的转录文本。要获取实际的字节数据，需要监听 `response.audio.delta` 事件。

输出块的格式可以为整个会话配置，也可以为每个响应单独配置。

* 会话级：`session.update` 中的 `session.output_audio_format`
* 响应级：`response.create` 中的 `response.output_audio_format`

监听 response.audio.delta 事件

```javascript theme={null}
function handleEvent(e) {
	const serverEvent = JSON.parse(e.data)
	if (serverEvent.type === 'response.audio.delta') {
		// 访问base64编码的音频块
		// console.log(serverEvent.delta);
	}
}

// 监听服务器消息（WebSocket）
ws.on('message', handleEvent)
```

```python theme={null}
def on_message(ws, message):
    server_event = json.loads(message)
    if server_event.type == "response.audio.delta":
        # 访问base64编码的音频块：
        # print(server_event.delta)
```

## 语音活动检测

默认情况下，实时会话启用**语音活动检测（VAD）**，这意味着 API 会判断用户何时开始或停止说话，并自动响应。

### 禁用 VAD

可以通过 `session.update` 客户端事件将 `turn_detection` 设置为 `null` 来禁用 VAD。这对于需要精细控制音频输入的界面很有用，例如按键通话界面。

禁用 VAD 后，客户端必须手动发出一些额外的客户端事件来触发音频响应：

* 手动发送 `input_audio_buffer.commit`，这将为对话创建新的用户输入项。
* 手动发送 `response.create` 以触发模型的音频响应。
* 在开始新的用户输入之前发送 `input_audio_buffer.clear`。

## 自定义 Tool Call

目前 StepFun Realtime API 内置了以下工具： web\_search 和 retrieval。

另外，也支持自定义工具调用（Tool Call），用户可以根据需求定义自己的工具并在会话中使用。

### 工具配置方式

在 update session 请求中配置 tools 参数来启用工具功能：

```json theme={null}
{
	"tools": [
		// 内置工具配置
		{
			"type": "web_search",
			"function": {
				"description": "网络搜索工具",
				"options": {
					"top_k": 5,
					"timeout_seconds": 3
				}
			}
		},
		// 自定义工具配置
		{
			"type": "function",
			"function": {
				"name": "get_weather",
				"description": "获取指定城市的天气信息",
				"parameters": {
					"type": "object",
					"properties": {
						"location": {
							"type": "string",
							"description": "城市名称，如：北京、上海"
						}
					},
					"required": ["location"]
				}
			}
		}
	]
}
```

### 网络搜索工具 web\_search

通过 `type: "web_search"` 工具，模型可实时访问互联网，获取最新信息和数据。适用于需要广泛知识和实时数据的场景。

#### 使用示例

在 update session 请求的 tools 参数中添加以下结构：

```json theme={null}
{
	"tools": [{ "type": "web_search" }]
}
```

即可在回复中启用 web\_search 工具。

#### 高级参数

* `options.top_k`: 指定返回结果的数量，默认为 5。
* `options.timeout_seconds`: 请求超时时间，默认为 3 秒。

示例

```json theme={null}
{
	"tools": [
		{
			"type": "web_search",
			"function": {
				"description": "搜索工具",
				"options": {
					"top_k": 5,
					"timeout_seconds": 3
				}
			}
		}
	]
}
```

启用后，模型在回答需要搜索的问题时会自动调用该工具，你不需要做其他额外的逻辑处理。

### 知识库检索工具 retrieval

通过 `type: "retrieval"` 工具，模型可实时访问指定知识库（Vector Store），精准回答用户问题。适用于垂直领域问答场景，用户可以提前定义用于参考的回复数据。

#### 一、功能配置步骤

在 update session 请求的 tools 参数中添加以下结构：

```json theme={null}
{
	"tools": [
		{
			"type": "retrieval",
			"function": {
				"description": "当用户问题涉及菜谱、烹饪步骤、食材处理或厨房技巧时，从此知识库检索答案",
				"options": {
					"vector_store_id": "133192900598194176", // 必填
					"prompt_template": "从文档{{knowledge}}中找到问题{{query}}的答案。根据文档内容中的语句提取答案，若文档中无答案则告知用户无法回答"
				}
			}
		}
	]
}
```

#### 二、参数详解

| 参数                        | 是否必填 | 说明                                                                                                                                                                  |
| ------------------------- | ---- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| type                      | 是    | 固定值 "retrieval"，声明使用知识库检索功能。                                                                                                                                        |
| function.description      | 是    | 关键提示：用1-2句话明确触发条件，例如：“仅当问题与菜谱、食材或烹饪相关时使用本知识库”。                                                                                                                      |
| options.vector\_store\_id | 是    | 知识库的唯一 ID（示例：133192900598194176），需提前在平台创建。                                                                                                                          |
| options.prompt\_template  | 否    | 定制化提示模板，支持占位符 `\{\{knowledge\}\}`（检索到的片段）和 `\{\{query\}\}`（用户原始问题）；若不提供则使用默认模板。默认：从文档 `\{\{knowledge\}\}` 中找到问题 `\{\{query\}\}` 的答案。根据文档内容中的语句提取答案，若文档中无答案则告知用户无法回答 |

#### 三、工作原理

1. 触发条件：用户问题匹配 description 中的领域描述（如“如何煎牛排？”）。
2. 知识检索：

* 系统根据 vector\_store\_id 定位知识库。
* 将用户问题 `\{\{query\}\}` 嵌入为向量，检索最相关的知识片段。

3. 生成答案：

* 自动将 `\{\{knowledge\}\}` 和 `\{\{query\}\}` 填入 prompt\_template。
* 模型基于模板生成最终回复（例：*“根据知识库：牛排需每面煎2分钟...”*）。

#### 四、最佳实践

1. 在 description 中精准描述：
   `"description": "仅限回答菜谱步骤、食材替代方案、烹饪时间问题，其他问题忽略检索"`
2. 错误处理：在 prompt\_template 中明确无答案时的响应，例如："下面是检索到的结果: \{\{knowledge}}，如果未在知识库中找到关于\{\{query}}的信息，请告诉用户尝试更具体的问题。"
3. 避免冲突：如果你启用了内置工具的 web\_search ，模型可能会尝试调用 web 搜索而不是在知识库，你可以在初始化会话时，通过 instructions 引导其优先使用知识库：\
   "你是一个专业烹饪助手，当用户询问菜谱、食材处理或烹饪技巧时，必须从知识库中检索答案。若知识库无相关信息，需明确告知用户。"

### 自定义 Tool Call

通过自定义函数调用，你可以扩展模型的能力，使其能够执行特定业务逻辑（如查询天气、生成星座运势等）。基本流程如下：

1. **配置函数**：在会话级别定义可用函数
2. **检测调用**：模型根据用户输入决定是否调用函数
3. **执行函数**：客户端使用生成的参数执行自定义代码
4. **返回结果**：将函数执行结果返回给模型并获取最终响应

#### 函数配置

通过 `session.update` 事件的 `session.tools` 参数添加函数定义：

```json theme={null}
{
	"type": "session.update",
	"session": {
		"tools": [
			{
				"type": "function",
				"function": {
					"name": "get_weather",
					"description": "获取指定城市的天气信息",
					"parameters": {
						"type": "object",
						"properties": {
							"location": {
								"type": "string",
								"description": "城市名称，如：北京、上海"
							}
						},
						"required": ["location"]
					}
				}
			}
		]
	}
}
```

#### 函数调用流程

##### 1. 检测函数调用

当用户输入触发函数调用时，您将收到包含函数调用参数的对话项：

```json theme={null}
{
	"event_id": "cee2a492-012c-4655-ac50-5db2c99f4345",
	"type": "conversation.item.created",
	"item": {
		"id": "item_20250622105814",
		"type": "function_call",
		"status": "incomplete",
		"call_id": "call_20250622225814_get_weather",
		"name": "get_weather",
		"object": "realtime.item"
	}
}
```

##### 2. 接收函数参数

参数生成过程中会收到增量更新事件：

```json theme={null}
{
	"type": "response.function_call_arguments.delta",
	"call_id": "call_20250622225814_get_weather",
	"arguments": "{\"location\":\"北",
	"name": "get_weather"
}
```

参数生成完成后会收到完成事件：

```json theme={null}
{
	"type": "response.function_call_arguments.done",
	"call_id": "call_20250622225814_get_weather",
	"arguments": "{\"location\":\"北京\"}",
	"name": "get_weather"
}
```

##### 3. 返回函数执行结果

客户端执行函数后，需要将结果返回给模型：

```json theme={null}
{
	"type": "conversation.item.create",
	"item": {
		"type": "function_call_output",
		"call_id": "call_20250622225814_get_weather",
		"output": "北京：晴，25°C"
	}
}
```

##### 4. 触发模型响应

返回函数结果后，需要手动触发模型生成最终响应：

```json theme={null}
{
	"type": "response.create"
}
```

#### 完整示例：星座运势查询

##### 1. 配置星座运势函数

```json theme={null}
{
	"type": "session.update",
	"session": {
		"tools": [
			{
				"type": "function",
				"function": {
					"name": "generate_horoscope",
					"description": "提供某个星座的今日运势。",
					"parameters": {
						"type": "object",
						"properties": {
							"sign": {
								"type": "string",
								"description": "需要查询运势的星座。",
								"enum": [
									"白羊座",
									"金牛座",
									"双子座",
									"巨蟹座",
									"狮子座",
									"处女座",
									"天秤座",
									"天蝎座",
									"射手座",
									"摩羯座",
									"水瓶座",
									"双鱼座"
								]
							}
						},
						"required": ["sign"]
					}
				}
			}
		]
	}
}
```

##### 2. 用户请求与函数调用

用户输入："我的运势如何？我是水瓶座。"

模型检测到需要调用函数，返回调用参数：

```json theme={null}
{
	"type": "response.done",
	"response": {
		"output": [
			{
				"type": "function_call",
				"name": "generate_horoscope",
				"call_id": "call_sHlR7iaFwQ2YQOqm",
				"arguments": "{\"sign\":\"水瓶座\"}"
			}
		]
	}
}
```

##### 3. 返回函数执行结果

```json theme={null}
{
	"type": "conversation.item.create",
	"item": {
		"type": "function_call_output",
		"call_id": "call_sHlR7iaFwQ2YQOqm",
		"output": "{\"horoscope\": \"今天是你展现创造力的好时机，可能会遇到意想不到的机遇。\"}"
	}
}
```

##### 4. 获取最终响应

```json theme={null}
{
	"type": "response.create"
}
```

通过以上流程，您可以灵活扩展模型能力，实现各种自定义功能。

> 注意，函数调用和语音可能会出现在同一轮，比如如果有 get\_weather 函数，模型可能会先输出：“让我先查查天气” 的口水话，然后再创建函数调用。这样的好处是，如果有比较复杂的函数调用，可以利用语音播放的时间来并行处理函数调用的生成。但是你需要注意要在语音播放完后才调用 `response.create` 来触发最终回复。

## 开发技巧

### 实现开场白

在某些应用场景下，你可能希望在对话开始时，模型能够主动打招呼或介绍自己。这可以通过在创建会话时，使用特殊的提示词实现：

```text theme={null}
请你原样无修改地输出下面的话：

你好，我是阶跃星辰开发的AI助手。
```

客户端主动创建响应：

```json theme={null}
{
	"type": "response.create",
	"session": {
		"instructions": "请你原样无修改地输出下面的话：你好，我是阶跃星辰开发的AI助手。"
	}
}
```

这样，模型会主动输出问候语：你好，我是阶跃星辰开发的 AI 助手。

### 更灵活的 VAD

如果你使用 server\_vad 模式，但每次 append 的音频片段较长，可能会导致系统响应延迟较高。这是因为 VAD 需要积累足够长的音频数据才能做出判断，建议以20ms 左右的间隔提交音频，以保证检测的实时性。

此外，VAD 判断用户停止说话的依据是检测到连续若干帧的空白或低音量音频。如果你提交的音频末尾没有包含足够的静音帧，VAD 可能无法准确识别语音结束点，从而导致系统未能及时触发自动回复。为避免该问题，应确保在语音段后保留一定时长的静音音频。

最佳实践建议：

* 音频分段提交：尽量以小块（如 20ms\~30ms）为单位频繁提交音频数据；
* 如果你用音频文件输入，务必在文件确保有一定的静音段；

### 网页中播放声音示例代码

为了支持流式 pcm16 的播放，你的播放逻辑需要支持多次 append 动作。另外，为了可以及时打断，最好可以实现及时清空缓冲区的功能。

这里是一份简易的示例代码，用 typescript 写成，适用于浏览器环境。

```typescript theme={null}
class SimplePCMPlayer {
	// 固定参数：单声道、24000Hz采样率（示例用）
	private static readonly SAMPLE_RATE = 24000
	private static readonly CHANNELS = 1

	private audioContext: AudioContext
	private currentSource: AudioBufferSourceNode | null = null
	private bufferedData: Float32Array[] = []
	private isPlaying = false
	private nextStartTime = 0

	constructor() {
		// 简单的兼容性处理
		const AudioContextConstructor = window.AudioContext || (window as any).webkitAudioContext
		if (!AudioContextConstructor) {
			throw new Error('浏览器不支持Web Audio API')
		}
		this.audioContext = new AudioContextConstructor()
	}

	/**
	 * 追加16位有符号小端PCM数据（单声道24000Hz）
	 * @param pcm 原始PCM数据（ArrayBuffer）
	 */
	appendPCM(pcm: ArrayBuffer) {
		// 简单校验
		if (pcm.byteLength % 2 !== 0) {
			console.error('PCM数据长度必须是2的倍数（16位整数）')
			return
		}

		// 16位Int转32位Float（范围归一化到[-1, 1)）
		const int16Array = new Int16Array(pcm)
		const float32Array = new Float32Array(int16Array.length)
		for (let i = 0; i < int16Array.length; i++) {
			float32Array[i] = int16Array[i] / 32768
		}

		this.bufferedData.push(float32Array)
		if (!this.isPlaying) {
			this.playNext()
		}
	}

	/**
	 * 播放下一段缓冲数据
	 */
	private playNext() {
		if (this.bufferedData.length === 0) {
			this.isPlaying = false
			return
		}

		this.isPlaying = true
		const data = this.bufferedData.shift()!

		// 创建音频缓冲
		const audioBuffer = this.audioContext.createBuffer(
			SimplePCMPlayer.CHANNELS,
			data.length,
			SimplePCMPlayer.SAMPLE_RATE,
		)
		audioBuffer.copyToChannel(data, 0)

		// 创建播放源
		const source = this.audioContext.createBufferSource()
		source.buffer = audioBuffer
		source.connect(this.audioContext.destination)

		// 计算播放时间（避免重叠）
		const currentTime = this.audioContext.currentTime
		const startTime = Math.max(currentTime, this.nextStartTime)
		this.nextStartTime = startTime + audioBuffer.duration

		// 开始播放
		source.start(startTime)
		this.currentSource = source

		// 播放结束后继续下一段
		source.onended = () => {
			this.currentSource = null
			this.playNext()
		}
	}

	/**
	 * 清空缓冲并停止播放
	 */
	clearAll() {
		// 停止当前播放
		if (this.currentSource) {
			this.currentSource.stop()
			this.currentSource.onended = null
			this.currentSource = null
		}

		// 清空状态
		this.bufferedData = []
		this.isPlaying = false
		this.nextStartTime = this.audioContext.currentTime
	}
}
```

## 错误处理

服务器在会话期间遇到错误情况时，会发出 `error` 事件。有时，这些错误可以追溯到您的应用发出的客户端事件。

与 HTTP 请求和响应不同（其中响应隐式关联到客户端的请求），我们需要使用客户端事件上的 `event_id` 属性来确定其中某个事件是否在服务器上触发了错误情况。下面的代码展示了这种方法，客户端尝试发出不支持的事件类型。

```javascript theme={null}
const event = {
	event_id: 'my_awesome_event',
	type: 'scooby.dooby.doo',
}

dataChannel.send(JSON.stringify(event))
```

客户端发送的这个失败事件将触发如下错误事件：

```json theme={null}
{
	"type": "invalid_request_error",
	"code": "invalid_value",
	"message": "无效值: 'scooby.dooby.doo' ...",
	"param": "type",
	"event_id": "my_awesome_event"
}
```