A set of resources that are used by the assistant’s tools. The resources are specific to the type of tool. For example, the code_interpreter tool requires a list of file IDs, while the file_search tool requires a list of vector store IDs.
Developer-provided instructions that the model should follow, regardless of
messages sent by the user. With o1 models and newer, developer messages
replace the previous system messages.
Developer-provided instructions that the model should follow, regardless of
messages sent by the user. With o1 models and newer, use developer messages
for this purpose instead.
The container will expire after this time period.
The anchor is the reference point for the expiration.
The minutes is the number of minutes after the anchor before the container expires.
A set of resources that are used by the assistant’s tools. The resources are specific to the type of tool. For example, the code_interpreter tool requires a list of file IDs, while the file_search tool requires a list of vector store IDs.
Represents a completion response from the API. Note: both the streamed and non-streamed response objects share the same shape (unlike the chat endpoint).
A CustomDataSourceConfig object that defines the schema for the data source used for the evaluation runs.
This schema is used to define the shape of the data that will be:
A data source config which specifies the metadata property of your logs query.
This is usually metadata like usecase=chatbot or prompt-version=v2, etc.
The settings for your integration with Weights and Biases. This payload specifies the project that
metrics will be sent to. Optionally, you can set an explicit display name for your run, add tags
to your run, and set a default entity (team, username, etc) to be associated with your run.
A set of resources that are used by the assistant’s tools. The resources are specific to the type of tool. For example, the code_interpreter tool requires a list of file IDs, while the file_search tool requires a list of vector store IDs.
A set of resources that are used by the assistant’s tools. The resources are specific to the type of tool. For example, the code_interpreter tool requires a list of file IDs, while the file_search tool requires a list of vector store IDs.
A set of resources that are made available to the assistant’s tools in this thread. The resources are specific to the type of tool. For example, the code_interpreter tool requires a list of file IDs, while the file_search tool requires a list of vector store IDs.
A message input to the model with a role indicating instruction following
hierarchy. Instructions given with the developer or system role take
precedence over instructions given with the user role. Messages with the
assistant role are presumed to have been generated by the model in previous
interactions.
A CustomDataSourceConfig which specifies the schema of your item and optionally sample namespaces.
The response schema defines the shape of the data that will be:
A message input to the model with a role indicating instruction following
hierarchy. Instructions given with the developer or system role take
precedence over instructions given with the user role. Messages with the
assistant role are presumed to have been generated by the model in previous
interactions.
A LogsDataSourceConfig which specifies the metadata property of your logs query.
This is usually metadata like usecase=chatbot or prompt-version=v2, etc.
The schema returned by this data source config is used to defined what variables are available in your evals.
item and sample are both defined when using this data source config.
Per-line training example for reinforcement fine-tuning. Note that messages and tools are the only reserved keywords. Any other arbitrary key-value data can be included on training datapoints and will be available to reference during grading under the {{ item.XXX }} template variable.
The settings for your integration with Weights and Biases. This payload specifies the project that
metrics will be sent to. Optionally, you can set an explicit display name for your run, add tags
to your run, and set a default entity (team, username, etc) to be associated with your run.
A message input to the model with a role indicating instruction following
hierarchy. Instructions given with the developer or system role take
precedence over instructions given with the user role.
A citation within the message that points to a specific quote from a specific File associated with the assistant or the message. Generated when the assistant uses the “file_search” tool to search files.
A citation within the message that points to a specific quote from a specific File associated with the assistant or the message. Generated when the assistant uses the “file_search” tool to search files.
A set of resources that are used by the assistant’s tools. The resources are specific to the type of tool. For example, the code_interpreter tool requires a list of file IDs, while the file_search tool requires a list of vector store IDs.
A set of resources that are made available to the assistant’s tools in this thread. The resources are specific to the type of tool. For example, the code_interpreter tool requires a list of file IDs, while the file_search tool requires a list of vector store IDs.
This is returned when the chunking strategy is unknown. Typically, this is because the file was indexed before the chunking_strategy concept was introduced in the API.
Add a new Item to the Conversation’s context, including messages, function
calls, and function call responses. This event can be used both to populate a
“history” of the conversation and to add new items mid-stream, but has the
current limitation that it cannot populate assistant audio messages.
Send this event when you want to remove any item from the conversation
history. The server will respond with a conversation.item.deleted event,
unless the item does not exist in the conversation history, in which case the
server will respond with an error.
Send this event when you want to retrieve the server’s representation of a specific item in the conversation history. This is useful, for example, to inspect user audio after noise cancellation and VAD.
The server will respond with a conversation.item.retrieved event,
unless the item does not exist in the conversation history, in which case the
server will respond with an error.
Send this event to truncate a previous assistant message’s audio. The server
will produce audio faster than realtime, so this event is useful when the user
interrupts to truncate audio that has already been sent to the client but not
yet played. This will synchronize the server’s understanding of the audio with
the client’s playback.
Send this event to append audio bytes to the input audio buffer. The audio
buffer is temporary storage you can write to and later commit. In Server VAD
mode, the audio buffer is used to detect speech and the server will decide
when to commit. When Server VAD is disabled, you must commit the audio buffer
manually.
Send this event to commit the user input audio buffer, which will create a
new user message item in the conversation. This event will produce an error
if the input audio buffer is empty. When in Server VAD mode, the client does
not need to send this event, the server will commit the audio buffer
automatically.
WebRTC Only: Emit to cut off the current audio response. This will trigger the server to
stop generating audio and emit a output_audio_buffer.cleared event. This
event should be preceded by a response.cancel client event to stop the
generation of the current response.
Learn more.
Send this event to cancel an in-progress response. The server will respond
with a response.cancelled event or an error if there is no response to
cancel.
This event instructs the server to create a Response, which means triggering
model inference. When in Server VAD mode, the server will create Responses
automatically.
Send this event to update the session’s default configuration.
The client may send this event at any time to update any field,
except for voice. However, note that once a session has been
initialized with a particular model, it can’t be changed to
another model using session.update.
Usage statistics for the Response, this will correspond to billing. A
Realtime API session will maintain a conversation context and append new
Items to the Conversation, thus output from previous turns (text and
audio tokens) will become the input for later turns.
Returned when an item in the conversation is deleted by the client with a
conversation.item.delete event. This event is used to synchronize the
server’s understanding of the conversation history with the client’s view.
This event is the output of audio transcription for user audio written to the
user audio buffer. Transcription begins when the input audio buffer is
committed by the client or server (in server_vad mode). Transcription runs
asynchronously with Response creation, so this event may come before or after
the Response events.
Returned when input audio transcription is configured, and a transcription
request for a user message failed. These events are separate from other
error events so that the client can identify the related Item.
Returned when an earlier assistant audio message item is truncated by the
client with a conversation.item.truncate event. This event is used to
synchronize the server’s understanding of the audio with the client’s playback.
Returned when an error occurs, which could be a client problem or a server
problem. Most errors are recoverable and the session will stay open, we
recommend to implementors to monitor and log error messages by default.
Returned when an input audio buffer is committed, either by the client or
automatically in server VAD mode. The item_id property is the ID of the user
message item that will be created, thus a conversation.item.created event
will also be sent to the client.
Sent by the server when in server_vad mode to indicate that speech has been
detected in the audio buffer. This can happen any time audio is added to the
buffer (unless speech is already detected). The client may want to use this
event to interrupt audio playback or provide visual feedback to the user.
Returned in server_vad mode when the server detects the end of speech in
the audio buffer. The server will also send an conversation.item.created
event with the user message item that is created from the audio buffer.
WebRTC Only: Emitted when the output audio buffer is cleared. This happens either in VAD
mode when the user has interrupted (input_audio_buffer.speech_started),
or when the client has emitted the output_audio_buffer.clear event to manually
cut off the current audio response.
Learn more.
WebRTC Only: Emitted when the server begins streaming audio to the client. This event is
emitted after an audio content part has been added (response.content_part.added)
to the response.
Learn more.
WebRTC Only: Emitted when the output audio buffer has been completely drained on the server,
and no more audio is forthcoming. This event is emitted after the full response
data has been sent to the client (response.done).
Learn more.
Emitted at the beginning of a Response to indicate the updated rate limits.
When a Response is created some tokens will be “reserved” for the output
tokens, the rate limits shown here reflect that reservation, which is then
adjusted accordingly once the Response is completed.
Returned when the model-generated transcription of audio output is done
streaming. Also emitted when a Response is interrupted, incomplete, or
cancelled.
Returned when a Response is done streaming. Always emitted, no matter the
final state. The Response object included in the response.done event will
include all output Items in the Response but will omit the raw audio data.
Returned when a Session is created. Emitted automatically when a new
connection is established as the first server event. This event will contain
the default Session configuration.
Configuration for input audio noise reduction. This can be set to null to turn off.
Noise reduction filters audio added to the input audio buffer before it is sent to VAD and the model.
Filtering the audio can improve VAD and turn detection accuracy (reducing false positives) and model performance by improving perception of the input audio.
Configuration for input audio transcription, defaults to off and can be set to null to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through the /audio/transcriptions endpoint and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.
Configuration for turn detection, ether Server VAD or Semantic VAD. This can be set to null to turn off, in which case the client must manually trigger model response.
Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.
Semantic VAD is more advanced and uses a turn detection model (in conjuction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with “uhhm”, the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.
Configuration for input audio transcription, defaults to off and can be
set to null to turn off once on. Input audio transcription is not native
to the model, since the model consumes audio directly. Transcription runs
asynchronously through Whisper and should be treated as rough guidance
rather than the representation understood by the model.
Configuration for turn detection. Can be set to null to turn off. Server
VAD means that the model will detect the start and end of speech based on
audio volume and respond at the end of user speech.
Configuration for input audio noise reduction. This can be set to null to turn off.
Noise reduction filters audio added to the input audio buffer before it is sent to VAD and the model.
Filtering the audio can improve VAD and turn detection accuracy (reducing false positives) and model performance by improving perception of the input audio.
Configuration for input audio transcription, defaults to off and can be set to null to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through the /audio/transcriptions endpoint and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.
Configuration for turn detection, ether Server VAD or Semantic VAD. This can be set to null to turn off, in which case the client must manually trigger model response.
Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.
Semantic VAD is more advanced and uses a turn detection model (in conjuction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with “uhhm”, the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.
Configuration for input audio noise reduction. This can be set to null to turn off.
Noise reduction filters audio added to the input audio buffer before it is sent to VAD and the model.
Filtering the audio can improve VAD and turn detection accuracy (reducing false positives) and model performance by improving perception of the input audio.
Configuration for input audio transcription. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.
Configuration for turn detection, ether Server VAD or Semantic VAD. This can be set to null to turn off, in which case the client must manually trigger model response.
Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.
Semantic VAD is more advanced and uses a turn detection model (in conjuction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with “uhhm”, the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.
Configuration for turn detection. Can be set to null to turn off. Server
VAD means that the model will detect the start and end of speech based on
audio volume and respond at the end of user speech.
A description of the chain of thought used by a reasoning model while generating
a response. Be sure to include these items in your input to the Responses API
for subsequent turns of a conversation if you are manually
managing context.
JSON object response format. An older method of generating JSON responses.
Using json_schema is recommended for models that support it. Note that the
model will not generate JSON without a system or user message instructing it
to do so.
A set of resources that are made available to the assistant’s tools in this thread. The resources are specific to the type of tool. For example, the code_interpreter tool requires a list of file IDs, while the file_search tool requires a list of vector store IDs.
Emitted when there is an additional text delta. This is also the first event emitted when the transcription starts. Only emitted when you create a transcription with the Stream parameter set to true.
Emitted when the transcription is complete. Contains the complete transcription text. Only emitted when you create a transcription with the Stream parameter set to true.
Controls which (if any) tool is called by the model.
none means the model will not call any tools and instead generates a message.
auto is the default value and means the model can pick between generating a message or calling one or more tools.
required means the model must call one or more tools before responding to the user.
Specifying a particular tool like {"type": "file_search"} or {"type": "function", "function": {"name": "my_function"}} forces the model to call that tool.
The format of the output, in one of these options: json, text, srt, verbose_json, or vtt. For gpt-4o-transcribe and gpt-4o-mini-transcribe, the only supported format is json.
Controls which (if any) tool is called by the model.
none means the model will not call any tool and instead generates a message.
auto means the model can pick between generating a message or calling one or more tools.
required means the model must call one or more tools.
Specifying a particular tool via {"type": "function", "function": {"name": "my_function"}} forces the model to call that tool.
Configuration for a Predicted Output,
which can greatly improve response times when large parts of the model
response are known ahead of time. This is most common when you are
regenerating a file with only minor changes to most of the content.
The reason the model stopped generating tokens. This will be stop if the model hit a natural stop point or a provided stop sequence,
length if the maximum number of tokens specified in the request was reached,
content_filter if content was omitted due to a flag from our content filters,
tool_calls if the model called a tool, or function_call (deprecated) if the model called a function.
The reason the model stopped generating tokens. This will be stop if the model hit a natural stop point or a provided stop sequence,
length if the maximum number of tokens specified in the request was reached,
content_filter if content was omitted due to a flag from our content filters,
tool_calls if the model called a tool, or function_call (deprecated) if the model called a function.
The reason the model stopped generating tokens. This will be stop if the model hit a natural stop point or a provided stop sequence,
length if the maximum number of tokens specified in the request was reached,
or content_filter if content was omitted due to a flag from our content filters.
Input text to embed, encoded as a string or array of tokens. To embed multiple inputs in a single request, pass an array of strings or array of token arrays. The input must not exceed the max input tokens for the model (8192 tokens for all embedding models), cannot be an empty string, and any array must be 2048 dimensions or less. Example Python code for counting tokens. In addition to the per-input token limit, all embedding models enforce a maximum of 300,000 tokens summed across all inputs in a single request.
Used when sampling from a model. Dictates the structure of the messages passed into the model. Can either be a reference to a prebuilt trajectory (ie, item.input_trajectory), or a template with variable references to the item namespace.
Used when sampling from a model. Dictates the structure of the messages passed into the model. Can either be a reference to a prebuilt trajectory (ie, item.input_trajectory), or a template with variable references to the item namespace.
Allows to set transparency for the background of the generated image(s).
This parameter is only supported for gpt-image-1. Must be one of
transparent, opaque or auto (default value). When auto is used, the
model will automatically determine the best background for the image.
The quality of the image that will be generated. high, medium and low are only supported for gpt-image-1. dall-e-2 only supports standard quality. Defaults to auto.
The format in which the generated images are returned. Must be one of url or b64_json. URLs are only valid for 60 minutes after the image has been generated. This parameter is only supported for dall-e-2, as gpt-image-1 will always return base64-encoded images.
The size of the generated images. Must be one of 1024x1024, 1536x1024 (landscape), 1024x1536 (portrait), or auto (default value) for gpt-image-1, and one of 256x256, 512x512, or 1024x1024 for dall-e-2.
Allows to set transparency for the background of the generated image(s).
This parameter is only supported for gpt-image-1. Must be one of
transparent, opaque or auto (default value). When auto is used, the
model will automatically determine the best background for the image.
The format in which generated images with dall-e-2 and dall-e-3 are returned. Must be one of url or b64_json. URLs are only valid for 60 minutes after the image has been generated. This parameter isn’t supported for gpt-image-1 which will always return base64-encoded images.
The size of the generated images. Must be one of 1024x1024, 1536x1024 (landscape), 1024x1536 (portrait), or auto (default value) for gpt-image-1, one of 256x256, 512x512, or 1024x1024 for dall-e-2, and one of 1024x1024, 1792x1024, or 1024x1792 for dall-e-3.
The style of the generated images. This parameter is only supported for dall-e-3. Must be one of vivid or natural. Vivid causes the model to lean towards generating hyper-real and dramatic images. Natural causes the model to produce more natural, less hyper-real looking images.
The format in which the generated images are returned. Must be one of url or b64_json. URLs are only valid for 60 minutes after the image has been generated.
The intended purpose of the uploaded file. One of: - assistants: Used in the Assistants API - batch: Used in the Batch API - fine-tune: Used for fine-tuning - vision: Images used for vision fine-tuning - user_data: Flexible file type for any purpose - evals: Used for eval data sets
The content that should be matched when generating a model response.
If generated tokens would match this content, the entire model response
can be returned much more quickly.
The status of the item (completed, incomplete). These have no effect
on the conversation, but are accepted for consistency with the
conversation.item.created event.
The status of the item (completed, incomplete). These have no effect
on the conversation, but are accepted for consistency with the
conversation.item.created event.
Controls which conversation the response is added to. Currently supports
auto and none, with auto as the default value. The auto value
means that the contents of the response will be added to the default
conversation. Set this to none to create an out-of-band response which
will not add items to default conversation.
Maximum number of output tokens for a single assistant response,
inclusive of tool calls. Provide an integer between 1 and 4096 to
limit output tokens, or inf for the maximum available tokens for a
given model. Defaults to inf.
The reason the Response did not complete. For a cancelled Response,
one of turn_detected (the server VAD detected a new start of speech)
or client_cancelled (the client sent a cancel event). For an
incomplete Response, one of max_output_tokens or content_filter
(the server-side safety filter activated and cut off the response).
The format of input audio. Options are pcm16, g711_ulaw, or g711_alaw.
For pcm16, input audio must be 16-bit PCM at a 24kHz sample rate,
single channel (mono), and little-endian byte order.
Type of noise reduction. near_field is for close-talking microphones such as headphones, far_field is for far-field microphones such as laptop or conference room microphones.
Maximum number of output tokens for a single assistant response,
inclusive of tool calls. Provide an integer between 1 and 4096 to
limit output tokens, or inf for the maximum available tokens for a
given model. Defaults to inf.
Used only for semantic_vad mode. The eagerness of the model to respond. low will wait longer for the user to continue speaking, high will respond more quickly. auto is the default and is equivalent to medium.
Maximum number of output tokens for a single assistant response,
inclusive of tool calls. Provide an integer between 1 and 4096 to
limit output tokens, or inf for the maximum available tokens for a
given model. Defaults to inf.
The format of input audio. Options are pcm16, g711_ulaw, or g711_alaw.
For pcm16, input audio must be 16-bit PCM at a 24kHz sample rate,
single channel (mono), and little-endian byte order.
Type of noise reduction. near_field is for close-talking microphones such as headphones, far_field is for far-field microphones such as laptop or conference room microphones.
Maximum number of output tokens for a single assistant response,
inclusive of tool calls. Provide an integer between 1 and 4096 to
limit output tokens, or inf for the maximum available tokens for a
given model. Defaults to inf.
Used only for semantic_vad mode. The eagerness of the model to respond. low will wait longer for the user to continue speaking, high will respond more quickly. auto is the default and is equivalent to medium.
The format of input audio. Options are pcm16, g711_ulaw, or g711_alaw.
For pcm16, input audio must be 16-bit PCM at a 24kHz sample rate,
single channel (mono), and little-endian byte order.
Type of noise reduction. near_field is for close-talking microphones such as headphones, far_field is for far-field microphones such as laptop or conference room microphones.
Used only for semantic_vad mode. The eagerness of the model to respond. low will wait longer for the user to continue speaking, high will respond more quickly. auto is the default and is equivalent to medium.
A summary of the reasoning performed by the model. This can be
useful for debugging and understanding the model’s reasoning process.
One of auto, concise, or detailed.
How the model should select which tool (or tools) to use when generating
a response. See the tools parameter to see how to specify which tools
the model can call.
The truncation strategy to use for the thread. The default is auto. If set to last_messages, the thread will be truncated to the n most recent messages in the thread. When set to auto, messages in the middle of the thread will be dropped to fit the context length of the model, max_prompt_tokens.
The status of the vector store file, which can be either in_progress, completed, cancelled, or failed. The status completed indicates that the vector store file is ready for use.
The status of the vector store, which can be either expired, in_progress, or completed. A status of completed indicates that the vector store is ready for use.
The parameters the functions accepts, described as a JSON Schema object. See the guide for examples, and the JSON Schema reference for documentation about the format.
Set of 16 key-value pairs that can be attached to an object. This can be
useful for storing additional information about the object in a structured
format, and querying for objects via API or the dashboard.
Controls how the audio is cut into chunks. When set to "auto", the server first normalizes loudness and then uses voice activity detection (VAD) to choose boundaries. server_vad object can be provided to tweak VAD detection parameters manually. If unset, the audio is transcribed as a single block.
Set of 16 key-value pairs that can be attached to an object. This can be
useful for storing additional information about the object in a structured
format, and querying for objects via API or the dashboard. Keys are strings
with a maximum length of 64 characters. Values are strings with a maximum
length of 512 characters, booleans, or numbers.