- model: o1-mini
model_provider: openai
inference_provider:
provider: openai
model_name: o1-mini
endpoint: null
price:
per_input_token: 3.0
per_output_token: 12.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 128000
description: The o1 series of large language models are trained with reinforcement learning to perform complex reasoning. o1 models think before they answer, producing a long internal chain of thought before responding to the user. Faster and cheaper reasoning model particularly good at coding, math, and science
parameters:
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
- model: o1-preview
model_provider: openai
inference_provider:
provider: openai
model_name: o1-preview
endpoint: null
price:
per_input_token: 15.0
per_output_token: 7.5
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 128000
description: The o1 series of large language models are trained with reinforcement learning to perform complex reasoning. o1 models think before they answer, producing a long internal chain of thought before responding to the user. Reasoning model designed to solve hard problems across domains
parameters:
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
- model: gpt-4o
model_provider: openai
inference_provider:
provider: openai
model_name: gpt-4o
endpoint: null
price:
per_input_token: 2.5
per_output_token: 10.0
valid_from: null
input_formats:
- text
- image
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 128000
description: High-intelligence flagship model for complex, multi-step tasks. GPT-4o is cheaper and faster than GPT-4 Turbo. It is multimodal (accepting text or image inputs and outputting text), and it has the same high intelligence as GPT-4 Turbo but is much more efficient—it generates text 2x faster and is 50% cheaper. Additionally, GPT-4o has the best vision and performance across non-English languages of any of our models. GPT-4o is available in the OpenAI API to paying customers.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
structured_outputs:
default: false
description: If true, instructs the model to return structured outputs (e.g., in JSON format) using the response_format provided.
required: false
type: boolean
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_logprobs:
default: null
description: Specifies the number of most likely tokens (from 0 to 20) to return at each token position, each with its associated log probability. (Requires that logprobs is enabled.)
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: gpt-4o-mini
model_provider: openai
inference_provider:
provider: openai
model_name: gpt-4o-mini
endpoint: null
price:
per_input_token: 0.15
per_output_token: 0.6
valid_from: null
input_formats:
- text
- image
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 128000
description: GPT-4o mini (o for omni) is a fast, affordable small model for focused tasks. It accepts both text and image inputs, and produces text outputs (including Structured Outputs). It is ideal for fine-tuning, and model outputs from a larger model like GPT-4o can be distilled to GPT-4o-mini to produce similar results at lower cost and latency.The knowledge cutoff for GPT-4o-mini models is October, 2023.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
structured_outputs:
default: false
description: If true, instructs the model to return structured outputs (e.g., in JSON format) using the response_format provided.
required: false
type: boolean
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_logprobs:
default: null
description: Specifies the number of most likely tokens (from 0 to 20) to return at each token position, each with its associated log probability. (Requires that logprobs is enabled.)
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: claude-3-5-sonnet-20240620
model_provider: anthropic
inference_provider:
provider: anthropic
model_name: claude-3-5-sonnet-20240620
endpoint: null
price:
per_input_token: 3.0
per_output_token: 15.0
valid_from: null
input_formats:
- text
- image
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 200000
description: Claude most intelligent model. Highest level of intelligence and capability
parameters:
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: gemini-1.5-pro-latest
model_provider: gemini
inference_provider:
provider: gemini
model_name: gemini-1.5-pro-latest
endpoint: null
price:
per_input_token: 2.5
per_output_token: 10.0
valid_from: null
input_formats:
- text
- image
- audio
- video
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 2000000
description: lightweight model, smaller and faster, lower price + higher rate limits + Lower latency on small prompts (compared to 1.5 Flash)
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
structured_outputs:
default: false
description: If true, instructs the model to return structured outputs (e.g., in JSON format) using the response_format provided.
required: false
type: boolean
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: gemini-2.0-flash-exp
model_provider: google
inference_provider:
provider: gemini
model_name: gemini-2.0-flash-exp
endpoint: null
price:
per_input_token: 2.0
per_output_token: 2.0
valid_from: null
input_formats:
- text
- image
- audio
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 1048576
description: Next generation features, speed, and multimodal generation for a diverse variety of tasks
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
structured_outputs:
default: false
description: If true, instructs the model to return structured outputs (e.g., in JSON format) using the response_format provided.
required: false
type: boolean
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: llama3-2-3b-instruct-v1.0
model_provider: meta
inference_provider:
provider: bedrock
model_name: llama3-2-3b-instruct-v1.0
endpoint: null
price:
per_input_token: 0.15
per_output_token: 0.15
valid_from: null
input_formats:
- text
- image
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 128000
description: Text-only lightweight model built to deliver highly accurate and relevant results. Designed for applications requiring low-latency inferencing and limited computational resources. Ideal for query and prompt rewriting, mobile AI-powered writing assistants, and customer service applications, particularly on edge devices where its efficiency and low latency enable seamless integration into various applications, including mobile AI-powered writing assistants and customer service chatbots.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_logprobs:
default: null
description: Specifies the number of most likely tokens (from 0 to 20) to return at each token position, each with its associated log probability. (Requires that logprobs is enabled.)
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: mistral-7b-instruct-v0.2
model_provider: mistralai
inference_provider:
provider: bedrock
model_name: mistral-7b-instruct-v0.2
endpoint: null
price:
per_input_token: 0.15
per_output_token: 0.2
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 32000
description: A 7B dense Transformer, fast-deployed and easily customizable. Small, yet powerful for a variety of use cases.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: grok-2
model_provider: xai
inference_provider:
provider: xai
model_name: grok-2
endpoint: https://api.x.ai/v1
price:
per_input_token: 2.0
per_output_token: 10.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 131072
description: Grok-2 is an advanced AI model developed by xAI, designed to provide highly accurate and helpful responses to a wide range of questions, often with a unique perspective on humanity.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_logprobs:
default: null
description: Specifies the number of most likely tokens (from 0 to 20) to return at each token position, each with its associated log probability. (Requires that logprobs is enabled.)
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: grok-2-vision-1212
model_provider: xai
inference_provider:
provider: xai
model_name: grok-2-vision-1212
endpoint: https://api.x.ai/v1
price:
per_input_token: 2.0
per_output_token: 10.0
valid_from: null
input_formats:
- text
- image
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 32768
description: Grok-2-vision-1212 is an advanced AI model developed by xAI that integrates multimodal capabilities, allowing it to process and understand both text and visual inputs to provide more comprehensive and context-aware responses
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_logprobs:
default: null
description: Specifies the number of most likely tokens (from 0 to 20) to return at each token position, each with its associated log probability. (Requires that logprobs is enabled.)
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: deepseek-reasoner
model_provider: deepseek
inference_provider:
provider: deepseek
model_name: deepseek-reasoner
endpoint: https://api.deepseek.com/v1
price:
per_input_token: 0.55
per_output_token: 2.19
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 64000
description: DeepSeek-Reasoner is an advanced AI model designed to enhance logical reasoning and problem-solving capabilities, leveraging deep learning techniques to provide accurate and contextually relevant insights across various domains.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
include_reasoning:
default: false
description: If the endpoint can return reasoning explicitly, setting this parameter will include reasoning tokens in the response (available in a separate field).
required: false
type: boolean
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_logprobs:
default: null
description: Specifies the number of most likely tokens (from 0 to 20) to return at each token position, each with its associated log probability. (Requires that logprobs is enabled.)
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: deepseek-chat
model_provider: deepseek
inference_provider:
provider: deepseek
model_name: deepseek-chat
endpoint: https://api.deepseek.com/v1
price:
per_input_token: 0.14
per_output_token: 0.28
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 64000
description: DeepSeek-Chat is an advanced conversational AI model designed to provide intelligent
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: dall-e-2
model_provider: openai
inference_provider:
provider: openai
model_name: dall-e-2
endpoint: null
price:
type_prices:
standard:
512x512: 0.018
1024x1024: 0.02
256x256: 0.016
mp_price: 1.23
valid_from: null
input_formats:
- text
output_formats:
- image
capabilities: []
type: image_generation
limits:
max_context_size: 0
description: DALL·E 2 is an advanced AI model by OpenAI that generates high-quality images from text descriptions, allowing for creative visualizations and edits of images based on user prompts.
parameters: null
- model: dall-e-3
model_provider: openai
inference_provider:
provider: openai
model_name: dall-e-3
endpoint: null
price:
type_prices:
hd:
1792x1024: 0.12
1024x1024: 0.08
1024x1792: 0.12
standard:
1024x1024: 0.04
1024x1792: 0.08
1792x1024: 0.08
mp_price: null
valid_from: null
input_formats:
- text
output_formats:
- image
capabilities: []
type: image_generation
limits:
max_context_size: 0
description: DALL·E 3 is the latest iteration of OpenAI's image generation model, offering even more accurate, detailed, and creative image creation from text prompts, with improved coherence and understanding of complex requests.
parameters: null
- model: gpt-3.5-turbo
model_provider: openai
inference_provider:
provider: openai
model_name: gpt-3.5-turbo-0125
endpoint: null
price:
per_input_token: 0.5
per_output_token: 1.5
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 16385
description: The latest GPT-3.5 Turbo model with higher accuracy at responding in requested formats and a fix for a bug which caused a text encoding issue for non-English language function calls.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_logprobs:
default: null
description: Specifies the number of most likely tokens (from 0 to 20) to return at each token position, each with its associated log probability. (Requires that logprobs is enabled.)
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: text-embedding-3-large
model_provider: openai
inference_provider:
provider: openai
model_name: text-embedding-3-large
endpoint: null
price:
per_input_token: 0.13
per_output_token: 0.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: embeddings
limits:
max_context_size: 8191
description: Most capable embedding model for both english and non-english tasks
parameters: null
- model: text-embedding-3-small
model_provider: openai
inference_provider:
provider: openai
model_name: text-embedding-3-small
endpoint: null
price:
per_input_token: 0.02
per_output_token: 0.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: embeddings
limits:
max_context_size: 8191
description: Increased performance over 2nd generation ada embedding model
parameters: null
- model: text-embedding-ada-002
model_provider: openai
inference_provider:
provider: openai
model_name: text-embedding-ada-002
endpoint: null
price:
per_input_token: 0.1
per_output_token: 0.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: embeddings
limits:
max_context_size: 8192
description: Most capable 2nd generation embedding model, replacing 16 first generation models
parameters: null
- model: claude-3-haiku-20240307
model_provider: anthropic
inference_provider:
provider: anthropic
model_name: claude-3-haiku-20240307
endpoint: null
price:
per_input_token: 0.25
per_output_token: 1.25
valid_from: null
input_formats:
- text
- image
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 200000
description: Fastest and most compact model for near-instant responsiveness. Quick and accurate targeted performance
parameters:
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: claude-3-opus-20240229
model_provider: anthropic
inference_provider:
provider: anthropic
model_name: claude-3-opus-20240229
endpoint: null
price:
per_input_token: 5.0
per_output_token: 75.0
valid_from: null
input_formats:
- text
- image
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 200000
description: Powerful model for highly complex tasks. Top-level intelligence, fluency, and understanding
parameters:
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: claude-3-sonnet-20240229
model_provider: anthropic
inference_provider:
provider: anthropic
model_name: claude-3-sonnet-20240229
endpoint: null
price:
per_input_token: 3.0
per_output_token: 15.0
valid_from: null
input_formats:
- text
- image
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 200000
description: Balance of intelligence and speed. Strong utility, balanced for scaled deployments
parameters:
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: gemini-1.5-flash-8b
model_provider: gemini
inference_provider:
provider: gemini
model_name: gemini-1.5-flash-8b
endpoint: null
price:
per_input_token: 0.075
per_output_token: 0.3
valid_from: null
input_formats:
- text
- image
- audio
- video
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 1000000
description: lightweight model, smaller and faster, lower price + higher rate limits + Lower latency on small prompts (compared to 1.5 Flash)
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
structured_outputs:
default: false
description: If true, instructs the model to return structured outputs (e.g., in JSON format) using the response_format provided.
required: false
type: boolean
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: gemini-1.5-flash-latest
model_provider: gemini
inference_provider:
provider: gemini
model_name: gemini-1.5-flash-latest
endpoint: null
price:
per_input_token: 0.15
per_output_token: 0.6
valid_from: null
input_formats:
- text
- image
- audio
- video
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 1000000
description: Fast and versatile performance across a diverse variety of tasks.
parameters: {}
- model: command-r-plus-v1.0
model_provider: cohere
inference_provider:
provider: bedrock
model_name: command-r-plus-v1.0
endpoint: null
price:
per_input_token: 3.0
per_output_token: 15.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 128000
description: Command R+ is an instruction-following conversational model that performs language tasks at a higher quality, more reliably, and with a longer context than previous models. It is best suited for complex RAG workflows and multi-step tool use.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
structured_outputs:
default: false
description: If true, instructs the model to return structured outputs (e.g., in JSON format) using the response_format provided.
required: false
type: boolean
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: command-r-v1.0
model_provider: cohere
inference_provider:
provider: bedrock
model_name: command-r-v1.0
endpoint: null
price:
per_input_token: 0.5
per_output_token: 1.5
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 128000
description: instruction-following conversational model that performs language tasks with high quality, more reliably and with a longer context than our base generative models.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
structured_outputs:
default: false
description: If true, instructs the model to return structured outputs (e.g., in JSON format) using the response_format provided.
required: false
type: boolean
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: llama3-1-70b-instruct-v1.0
model_provider: meta
inference_provider:
provider: bedrock
model_name: llama3-1-70b-instruct-v1.0
endpoint: null
price:
per_input_token: 0.72
per_output_token: 0.72
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 128000
description: Ideal for content creation, conversational AI, language understanding, research development, and enterprise applications. With new latency-optimized inference capabilities available in public preview, this model sets a new performance benchmark for AI solutions that process extensive text inputs, enabling applications to respond more quickly and handle longer queries more efficiently.
parameters: null
- model: llama3-1-8b-instruct-v1.0
model_provider: meta
inference_provider:
provider: bedrock
model_name: llama3-1-8b-instruct-v1.0
endpoint: null
price:
per_input_token: 0.22
per_output_token: 0.22
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 128000
description: Ideal for limited computational power and resources, faster training times, and edge devices.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_logprobs:
default: null
description: Specifies the number of most likely tokens (from 0 to 20) to return at each token position, each with its associated log probability. (Requires that logprobs is enabled.)
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: llama3-2-1b-instruct-v1.0
model_provider: meta
inference_provider:
provider: bedrock
model_name: llama3-2-1b-instruct-v1.0
endpoint: null
price:
per_input_token: 0.1
per_output_token: 0.1
valid_from: null
input_formats:
- text
- image
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 128000
description: Text-only lightweight model built to deliver fast and accurate responses. Ideal for edge devices and mobile applications. The model enables on-device AI capabilities while preserving user privacy and minimizing latency.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: llama3-70b-instruct-v1.0
model_provider: meta
inference_provider:
provider: bedrock
model_name: llama3-70b-instruct-v1.0
endpoint: null
price:
per_input_token: 2.65
per_output_token: 3.5
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 32000
description: 'ideal for content creation, conversational AI, language understanding, research development, and enterprise applications. '
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_logprobs:
default: null
description: Specifies the number of most likely tokens (from 0 to 20) to return at each token position, each with its associated log probability. (Requires that logprobs is enabled.)
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: llama3-8b-instruct-v1.0
model_provider: meta
inference_provider:
provider: bedrock
model_name: llama3-8b-instruct-v1.0
endpoint: null
price:
per_input_token: 0.3
per_output_token: 0.6
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 32000
description: ideal for limited computational power and resources, and edge devices. The model excels at text summarization, text classification, sentiment analysis, and language translation.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_logprobs:
default: null
description: Specifies the number of most likely tokens (from 0 to 20) to return at each token position, each with its associated log probability. (Requires that logprobs is enabled.)
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: mixtral-8x7b-instruct-v0.1
model_provider: mistral
inference_provider:
provider: bedrock
model_name: mixtral-8x7b-instruct-v0.1
endpoint: null
price:
per_input_token: 0.45
per_output_token: 0.7
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 32000
description: A 7B sparse Mixture-of-Experts model with stronger capabilities than Mistral AI 7B. Uses 12B active parameters out of 45B total.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: grok-beta
model_provider: xai
inference_provider:
provider: xai
model_name: grok-beta
endpoint: https://api.x.ai/v1
price:
per_input_token: 5.0
per_output_token: 15.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 131072
description: Grok-beta is an experimental AI model developed by xAI, designed to provide insightful, witty, and context-aware responses while continuously learning and improving through user interactions.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_logprobs:
default: null
description: Specifies the number of most likely tokens (from 0 to 20) to return at each token position, each with its associated log probability. (Requires that logprobs is enabled.)
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: grok-vision-beta
model_provider: xai
inference_provider:
provider: xai
model_name: grok-vision-beta
endpoint: https://api.x.ai/v1
price:
per_input_token: 5.0
per_output_token: 15.0
valid_from: null
input_formats:
- text
- image
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 8192
description: Grok-vision-beta is an advanced AI model developed by xAI that integrates multimodal capabilities, allowing it to process and understand both text and visual inputs to provide more comprehensive and context-aware responses
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_logprobs:
default: null
description: Specifies the number of most likely tokens (from 0 to 20) to return at each token position, each with its associated log probability. (Requires that logprobs is enabled.)
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: dbrx-instruct
model_provider: databricks
inference_provider:
provider: togetherai
model_name: databricks/dbrx-instruct
endpoint: https://api.together.xyz/v1
price:
per_input_token: 0.8
per_output_token: 0.8
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 32768
description: DBRX Instruct is a model by Databricks, designed for instruction-following tasks and general language understanding.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: deepseek-r1
model_provider: deepseek
inference_provider:
provider: fireworksai
model_name: accounts/fireworks/models/deepseek-r1
endpoint: https://api.fireworks.ai/inference/v1
price:
per_input_token: 8.0
per_output_token: 8.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 160000
description: DeepSeek-Reasoner is an advanced AI model designed to enhance logical reasoning and problem-solving capabilities, leveraging deep learning techniques to provide accurate and contextually relevant insights across various domains.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
include_reasoning:
default: false
description: If the endpoint can return reasoning explicitly, setting this parameter will include reasoning tokens in the response (available in a separate field).
required: false
type: boolean
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_logprobs:
default: null
description: Specifies the number of most likely tokens (from 0 to 20) to return at each token position, each with its associated log probability. (Requires that logprobs is enabled.)
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: DeepSeek-R1
model_provider: deepseek
inference_provider:
provider: deepinfra
model_name: deepseek-ai/DeepSeek-R1
endpoint: https://api.deepinfra.com/v1/openai
price:
per_input_token: 0.75
per_output_token: 2.4
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 16000
description: DeepSeek-Reasoner is an advanced AI model designed to enhance logical reasoning and problem-solving capabilities, leveraging deep learning techniques to provide accurate and contextually relevant insights across various domains.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
include_reasoning:
default: false
description: If the endpoint can return reasoning explicitly, setting this parameter will include reasoning tokens in the response (available in a separate field).
required: false
type: boolean
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_logprobs:
default: null
description: Specifies the number of most likely tokens (from 0 to 20) to return at each token position, each with its associated log probability. (Requires that logprobs is enabled.)
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: DeepSeek-R1-Distill-Llama-70B
model_provider: deepseek
inference_provider:
provider: deepinfra
model_name: deepseek-ai/DeepSeek-R1-Distill-Llama-70B
endpoint: https://api.deepinfra.com/v1/openai
price:
per_input_token: 0.23
per_output_token: 0.69
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 131072
description: DeepSeek-R1-Distill models are fine-tuned based on open-source models, using samples generated by DeepSeek-R1.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
include_reasoning:
default: false
description: If the endpoint can return reasoning explicitly, setting this parameter will include reasoning tokens in the response (available in a separate field).
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: DeepSeek-R1-Distill-Llama-70B
model_provider: deepseek
inference_provider:
provider: togetherai
model_name: deepseek-ai/DeepSeek-R1-Distill-Llama-70B
endpoint: https://api.together.xyz/v1
price:
per_input_token: 2.0
per_output_token: 2.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 131072
description: DeepSeek-R1-Distill models are fine-tuned based on open-source models, using samples generated by DeepSeek-R1.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
include_reasoning:
default: false
description: If the endpoint can return reasoning explicitly, setting this parameter will include reasoning tokens in the response (available in a separate field).
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: DeepSeek-R1-Distill-Qwen-14B
model_provider: deepseek
inference_provider:
provider: togetherai
model_name: deepseek-ai/DeepSeek-R1-Distill-Qwen-14B
endpoint: https://api.together.xyz/v1
price:
per_input_token: 1.6
per_output_token: 1.6
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 131072
description: DeepSeek-R1-Distill models are fine-tuned based on open-source models, using samples generated by DeepSeek-R1.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
include_reasoning:
default: false
description: If the endpoint can return reasoning explicitly, setting this parameter will include reasoning tokens in the response (available in a separate field).
required: false
type: boolean
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: DeepSeek-R1-Distill-Qwen-1.5B
model_provider: deepseek
inference_provider:
provider: togetherai
model_name: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
endpoint: https://api.together.xyz/v1
price:
per_input_token: 0.18
per_output_token: 0.18
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 131072
description: DeepSeek-R1-Distill models are fine-tuned based on open-source models, using samples generated by DeepSeek-R1.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
include_reasoning:
default: false
description: If the endpoint can return reasoning explicitly, setting this parameter will include reasoning tokens in the response (available in a separate field).
required: false
type: boolean
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: DeepSeek-R1-Distill-Qwen-32B
model_provider: deepseek
inference_provider:
provider: deepinfra
model_name: deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
endpoint: https://api.deepinfra.com/v1/openai
price:
per_input_token: 0.12
per_output_token: 0.18
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 131072
description: DeepSeek-R1-Distill models are fine-tuned based on open-source models, using samples generated by DeepSeek-R1.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
include_reasoning:
default: false
description: If the endpoint can return reasoning explicitly, setting this parameter will include reasoning tokens in the response (available in a separate field).
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: deepseek-v3
model_provider: deepseek
inference_provider:
provider: fireworksai
model_name: accounts/fireworks/models/deepseek-v3
endpoint: https://api.fireworks.ai/inference/v1
price:
per_input_token: 0.9
per_output_token: 0.9
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 128000
description: A strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token from Deepseek.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: DeepSeek-V3
model_provider: deepseek
inference_provider:
provider: deepinfra
model_name: deepseek-ai/DeepSeek-V3
endpoint: https://api.deepinfra.com/v1/openai
price:
per_input_token: 0.49
per_output_token: 0.89
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 16000
description: A strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token from Deepseek.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: DeepSeek-V3
model_provider: deepseek
inference_provider:
provider: togetherai
model_name: deepseek-ai/DeepSeek-V3
endpoint: https://api.together.xyz/v1
price:
per_input_token: 1.25
per_output_token: 1.25
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 131072
description: A strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token from Deepseek.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: gemma-2-27b-it
model_provider: google
inference_provider:
provider: togetherai
model_name: google/gemma-2-27b-it
endpoint: https://api.together.xyz/v1
price:
per_input_token: 0.3
per_output_token: 0.3
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 8192
description: Gemma 2 offers best-in-class performance, runs at incredible speed across different hardware and easily integrates with other AI tools.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: gemma-2-9b-it
model_provider: google
inference_provider:
provider: togetherai
model_name: google/gemma-2-9b-it
endpoint: https://api.together.xyz/v1
price:
per_input_token: 0.3
per_output_token: 0.3
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 8192
description: Gemma 2 offers best-in-class performance, runs at incredible speed across different hardware and easily integrates with other AI tools.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: Llama-2-13b-chat-hf
model_provider: meta
inference_provider:
provider: togetherai
model_name: meta-llama/Llama-2-13b-chat-hf
endpoint: https://api.together.xyz/v1
price:
per_input_token: 0.3
per_output_token: 0.3
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 4096
description: LLaMA-2 Chat (13B) is Meta's conversational AI model, designed for engaging and coherent dialogue.
parameters: null
- model: Llama-3.1-Nemotron-70B-Instruct-HF
model_provider: nvidia
inference_provider:
provider: togetherai
model_name: nvidia/Llama-3.1-Nemotron-70B-Instruct-HF
endpoint: https://api.together.xyz/v1
price:
per_input_token: 0.9
per_output_token: 0.9
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 32768
description: Llama-3.1-Nemotron-70B-Instruct is a large language model customized by NVIDIA in order to improve the helpfulness of LLM generated responses.
parameters: null
- model: llama-v3p1-405b-instruct
model_provider: meta
inference_provider:
provider: fireworksai
model_name: accounts/fireworks/models/llama-v3p1-405b-instruct
endpoint: https://api.fireworks.ai/inference/v1
price:
per_input_token: 3.0
per_output_token: 3.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 128000
description: Text-only lightweight model built to deliver fast and accurate responses. Ideal for edge devices and mobile applications. The model enables on-device AI capabilities while preserving user privacy and minimizing latency.
parameters: null
- model: MythoMax-L2-13b
model_provider: gryphe
inference_provider:
provider: togetherai
model_name: Gryphe/MythoMax-L2-13b
endpoint: https://api.together.xyz/v1
price:
per_input_token: 0.3
per_output_token: 0.3
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 4096
description: MythoMax-L2 (13B) is a model by Gryphe, known for its creative text generation capabilities.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: Nous-Hermes-2-Mixtral-8x7B-DPO
model_provider: nousresearch
inference_provider:
provider: togetherai
model_name: NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO
endpoint: https://api.together.xyz/v1
price:
per_input_token: 0.9
per_output_token: 0.9
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 32768
description: Nous Hermes 2 - Mixtral 8x7B-DPO (46.7B) is a large model by NousResearch, utilizing advanced training techniques for improved performance.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: phi-4
model_provider: microsoft
inference_provider:
provider: deepinfra
model_name: microsoft/phi-4
endpoint: https://api.deepinfra.com/v1/openai
price:
per_input_token: 0.07
per_output_token: 0.14
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 16384
description: Phi-4 is a model built upon a blend of synthetic datasets, data from filtered public domain websites, and acquired academic books and Q&A datasets. The goal of this approach was to ensure that small capable models were trained with data focused on high quality and advanced reasoning.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: Qwen2.5-72B-Instruct-Turbo
model_provider: qwen
inference_provider:
provider: togetherai
model_name: Qwen/Qwen2.5-72B-Instruct-Turbo
endpoint: https://api.together.xyz/v1
price:
per_input_token: 1.2
per_output_token: 1.2
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 32768
description: Qwen 2.5 72B Instruct is a large-scale model by Qwen, offering advanced capabilities for complex language tasks.
parameters:
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: Qwen2.5-7B-Instruct-Turbo
model_provider: qwen
inference_provider:
provider: togetherai
model_name: Qwen/Qwen2.5-7B-Instruct-Turbo
endpoint: https://api.together.xyz/v1
price:
per_input_token: 0.3
per_output_token: 0.3
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 32768
description: Qwen 2.5 7B Instruct is a fast and efficient model by Qwen, optimized for quick responses in instruction-following tasks.
parameters:
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: Qwen2.5-Coder-32B-Instruct
model_provider: qwen
inference_provider:
provider: togetherai
model_name: Qwen/Qwen2.5-Coder-32B-Instruct
endpoint: https://api.together.xyz/v1
price:
per_input_token: 0.8
per_output_token: 0.8
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 32768
description: Qwen2.5-Coder is the latest series of Code-Specific Qwen large language models (formerly known as CodeQwen)
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_logprobs:
default: null
description: Specifies the number of most likely tokens (from 0 to 20) to return at each token position, each with its associated log probability. (Requires that logprobs is enabled.)
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: Qwen2-72B-Instruct
model_provider: qwen
inference_provider:
provider: togetherai
model_name: Qwen/Qwen2-72B-Instruct
endpoint: https://api.together.xyz/v1
price:
per_input_token: 1.2
per_output_token: 1.2
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 32768
description: Qwen 2 Instruct (72B) is a large-scale model by Qwen, offering advanced capabilities for complex language tasks.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: qwen2p5-coder-32b-instruct
model_provider: qwen
inference_provider:
provider: fireworksai
model_name: accounts/fireworks/models/qwen2p5-coder-32b-instruct
endpoint: https://api.fireworks.ai/inference/v1
price:
per_input_token: 0.9
per_output_token: 0.9
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 128000
description: Qwen2.5-Coder is the latest series of Code-Specific Qwen large language models (formerly known as CodeQwen)
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_logprobs:
default: null
description: Specifies the number of most likely tokens (from 0 to 20) to return at each token position, each with its associated log probability. (Requires that logprobs is enabled.)
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: Qwen/Qwen2.5-Coder-32B-Instruct
model_provider: qwen
inference_provider:
provider: deepinfra
model_name: Qwen/Qwen2.5-Coder-32B-Instruct
endpoint: https://api.deepinfra.com/v1/openai
price:
per_input_token: 0.07
per_output_token: 0.16
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 32768
description: Qwen2.5-Coder is the latest series of Code-Specific Qwen large language models (formerly known as CodeQwen).
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_logprobs:
default: null
description: Specifies the number of most likely tokens (from 0 to 20) to return at each token position, each with its associated log probability. (Requires that logprobs is enabled.)
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: QwQ-32B-Preview
model_provider: qwen
inference_provider:
provider: togetherai
model_name: Qwen/QwQ-32B-Preview
endpoint: https://api.together.xyz/v1
price:
per_input_token: 1.2
per_output_token: 1.2
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 32768
description: QwQ-32B-Preview is an experimental research model developed by the Qwen Team, focused on advancing AI reasoning capabilities.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: SOLAR-10.7B-Instruct-v1.0
model_provider: upstage
inference_provider:
provider: togetherai
model_name: upstage/SOLAR-10.7B-Instruct-v1.0
endpoint: https://api.together.xyz/v1
price:
per_input_token: 0.3
per_output_token: 0.3
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 4096
description: Upstage SOLAR Instruct v1 (11B) is a versatile model by Upstage, focused on following instructions across various domains.
parameters: null
- model: WizardLM-2-8x22B
model_provider: microsoft
inference_provider:
provider: togetherai
model_name: microsoft/WizardLM-2-8x22B
endpoint: https://api.together.xyz/v1
price:
per_input_token: 0.3
per_output_token: 0.3
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 65536
description: WizardLM-2 8x22B is a large language model developed by Microsoft, known for its advanced capabilities in natural language understanding and generation.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: aion-1.0
model_provider: aion-labs
inference_provider:
provider: openrouter
model_name: aion-labs/aion-1.0
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 8.0
per_output_token: 24.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 16384
description: Aion-1.0 is a multi-model system designed for high performance across various tasks, including reasoning and coding. It is built on DeepSeek-R1, augmented with additional models and techniques such as Tree of Thoughts (ToT) and Mixture of Experts (MoE). It is Aion Lab's most powerful reasoning model.
parameters:
include_reasoning:
default: false
description: If the endpoint can return reasoning explicitly, setting this parameter will include reasoning tokens in the response (available in a separate field).
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: aion-1.0-mini
model_provider: aion-labs
inference_provider:
provider: openrouter
model_name: aion-labs/aion-1.0-mini
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.7999999999999999
per_output_token: 2.4
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 16384
description: Aion-1.0-Mini 32B parameter model is a distilled version of the DeepSeek-R1 model, designed for strong performance in reasoning domains such as mathematics, coding, and logic. It is a modified variant of a FuseAI model that outperforms R1-Distill-Qwen-32B and R1-Distill-Llama-70B, with benchmark results available on its [Hugging Face page](https://huggingface.co/FuseAI/FuseO1-DeepSeekR1-QwQ-SkyT1-32B-Preview), independently replicated for verification.
parameters:
include_reasoning:
default: false
description: If the endpoint can return reasoning explicitly, setting this parameter will include reasoning tokens in the response (available in a separate field).
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: aion-rp-llama-3.1-8b
model_provider: aion-labs
inference_provider:
provider: openrouter
model_name: aion-labs/aion-rp-llama-3.1-8b
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.2
per_output_token: 0.2
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 32768
description: Aion-RP-Llama-3.1-8B ranks the highest in the character evaluation portion of the RPBench-Auto benchmark, a roleplaying-specific variant of Arena-Hard-Auto, where LLMs evaluate each other’s responses. It is a fine-tuned base model rather than an instruct model, designed to produce more natural and varied writing.
parameters:
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: airoboros-l2-70b
model_provider: jondurbin
inference_provider:
provider: openrouter
model_name: jondurbin/airoboros-l2-70b
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.5
per_output_token: 0.5
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 4000
description: |-
A Llama 2 70B fine-tune using synthetic data (the Airoboros dataset).
Currently based on [jondurbin/airoboros-l2-70b](https://huggingface.co/jondurbin/airoboros-l2-70b-2.2.1), but might get updated in the future.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: chatgpt-4o-latest
model_provider: openai
inference_provider:
provider: openrouter
model_name: openai/chatgpt-4o-latest
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 5.0
per_output_token: 15.0
valid_from: null
input_formats:
- text
- image
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 128000
description: |-
OpenAI ChatGPT 4o is continually updated by OpenAI to point to the current version of GPT-4o used by ChatGPT. It therefore differs slightly from the API version of [GPT-4o](/models/openai/gpt-4o) in that it has additional RLHF. It is intended for research and evaluation.
OpenAI notes that this model is not suited for production use-cases as it may be removed or redirected to another model in the future.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
structured_outputs:
default: false
description: If true, instructs the model to return structured outputs (e.g., in JSON format) using the response_format provided.
required: false
type: boolean
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_logprobs:
default: null
description: Specifies the number of most likely tokens (from 0 to 20) to return at each token position, each with its associated log probability. (Requires that logprobs is enabled.)
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: claude-2
model_provider: anthropic
inference_provider:
provider: openrouter
model_name: anthropic/claude-2
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 8.0
per_output_token: 24.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 200000
description: 'Claude 2 delivers advancements in key capabilities for enterprises—including an industry-leading 200K token context window, significant reductions in rates of model hallucination, system prompts and a new beta feature: tool use.'
parameters:
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: claude-2.0
model_provider: anthropic
inference_provider:
provider: openrouter
model_name: anthropic/claude-2.0
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 8.0
per_output_token: 24.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 100000
description: Anthropic's flagship model. Superior performance on tasks that require complex reasoning. Supports hundreds of pages of text.
parameters:
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: claude-2.0:beta
model_provider: anthropic
inference_provider:
provider: openrouter
model_name: anthropic/claude-2.0:beta
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 8.0
per_output_token: 24.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 100000
description: Anthropic's flagship model. Superior performance on tasks that require complex reasoning. Supports hundreds of pages of text.
parameters:
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: claude-2.1
model_provider: anthropic
inference_provider:
provider: openrouter
model_name: anthropic/claude-2.1
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 8.0
per_output_token: 24.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 200000
description: 'Claude 2 delivers advancements in key capabilities for enterprises—including an industry-leading 200K token context window, significant reductions in rates of model hallucination, system prompts and a new beta feature: tool use.'
parameters:
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: claude-2.1:beta
model_provider: anthropic
inference_provider:
provider: openrouter
model_name: anthropic/claude-2.1:beta
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 8.0
per_output_token: 24.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 200000
description: 'Claude 2 delivers advancements in key capabilities for enterprises—including an industry-leading 200K token context window, significant reductions in rates of model hallucination, system prompts and a new beta feature: tool use.'
parameters:
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: claude-2:beta
model_provider: anthropic
inference_provider:
provider: openrouter
model_name: anthropic/claude-2:beta
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 8.0
per_output_token: 24.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 200000
description: 'Claude 2 delivers advancements in key capabilities for enterprises—including an industry-leading 200K token context window, significant reductions in rates of model hallucination, system prompts and a new beta feature: tool use.'
parameters:
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: claude-3.5-haiku
model_provider: anthropic
inference_provider:
provider: openrouter
model_name: anthropic/claude-3.5-haiku
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.7999999999999999
per_output_token: 4.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 200000
description: |-
Claude 3.5 Haiku features offers enhanced capabilities in speed, coding accuracy, and tool use. Engineered to excel in real-time applications, it delivers quick response times that are essential for dynamic tasks such as chat interactions and immediate coding suggestions.
This makes it highly suitable for environments that demand both speed and precision, such as software development, customer service bots, and data management systems.
This model is currently pointing to [Claude 3.5 Haiku (2024-10-22)](/anthropic/claude-3-5-haiku-20241022).
parameters:
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: claude-3.5-haiku-20241022
model_provider: anthropic
inference_provider:
provider: openrouter
model_name: anthropic/claude-3.5-haiku-20241022
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.7999999999999999
per_output_token: 4.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 200000
description: |-
Claude 3.5 Haiku features enhancements across all skill sets including coding, tool use, and reasoning. As the fastest model in the Anthropic lineup, it offers rapid response times suitable for applications that require high interactivity and low latency, such as user-facing chatbots and on-the-fly code completions. It also excels in specialized tasks like data extraction and real-time content moderation, making it a versatile tool for a broad range of industries.
It does not support image inputs.
See the launch announcement and benchmark results [here](https://www.anthropic.com/news/3-5-models-and-computer-use)
parameters:
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: claude-3.5-haiku-20241022:beta
model_provider: anthropic
inference_provider:
provider: openrouter
model_name: anthropic/claude-3.5-haiku-20241022:beta
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.7999999999999999
per_output_token: 4.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 200000
description: |-
Claude 3.5 Haiku features enhancements across all skill sets including coding, tool use, and reasoning. As the fastest model in the Anthropic lineup, it offers rapid response times suitable for applications that require high interactivity and low latency, such as user-facing chatbots and on-the-fly code completions. It also excels in specialized tasks like data extraction and real-time content moderation, making it a versatile tool for a broad range of industries.
It does not support image inputs.
See the launch announcement and benchmark results [here](https://www.anthropic.com/news/3-5-models-and-computer-use)
parameters:
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: claude-3.5-haiku:beta
model_provider: anthropic
inference_provider:
provider: openrouter
model_name: anthropic/claude-3.5-haiku:beta
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.7999999999999999
per_output_token: 4.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 200000
description: |-
Claude 3.5 Haiku features offers enhanced capabilities in speed, coding accuracy, and tool use. Engineered to excel in real-time applications, it delivers quick response times that are essential for dynamic tasks such as chat interactions and immediate coding suggestions.
This makes it highly suitable for environments that demand both speed and precision, such as software development, customer service bots, and data management systems.
This model is currently pointing to [Claude 3.5 Haiku (2024-10-22)](/anthropic/claude-3-5-haiku-20241022).
parameters:
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: claude-3.5-sonnet
model_provider: anthropic
inference_provider:
provider: openrouter
model_name: anthropic/claude-3.5-sonnet
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 3.0
per_output_token: 15.0
valid_from: null
input_formats:
- text
- image
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 200000
description: |-
New Claude 3.5 Sonnet delivers better-than-Opus capabilities, faster-than-Sonnet speeds, at the same Sonnet prices. Sonnet is particularly good at:
- Coding: Scores ~49% on SWE-Bench Verified, higher than the last best score, and without any fancy prompt scaffolding
- Data science: Augments human data science expertise; navigates unstructured data while using multiple tools for insights
- Visual processing: excelling at interpreting charts, graphs, and images, accurately transcribing text to derive insights beyond just the text alone
- Agentic tasks: exceptional tool use, making it great at agentic tasks (i.e. complex, multi-step problem solving tasks that require engaging with other systems)
#multimodal
parameters:
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: claude-3.5-sonnet-20240620
model_provider: anthropic
inference_provider:
provider: openrouter
model_name: anthropic/claude-3.5-sonnet-20240620
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 3.0
per_output_token: 15.0
valid_from: null
input_formats:
- text
- image
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 200000
description: |-
Claude 3.5 Sonnet delivers better-than-Opus capabilities, faster-than-Sonnet speeds, at the same Sonnet prices. Sonnet is particularly good at:
- Coding: Autonomously writes, edits, and runs code with reasoning and troubleshooting
- Data science: Augments human data science expertise; navigates unstructured data while using multiple tools for insights
- Visual processing: excelling at interpreting charts, graphs, and images, accurately transcribing text to derive insights beyond just the text alone
- Agentic tasks: exceptional tool use, making it great at agentic tasks (i.e. complex, multi-step problem solving tasks that require engaging with other systems)
For the latest version (2024-10-23), check out [Claude 3.5 Sonnet](/anthropic/claude-3.5-sonnet).
#multimodal
parameters:
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: claude-3.5-sonnet-20240620:beta
model_provider: anthropic
inference_provider:
provider: openrouter
model_name: anthropic/claude-3.5-sonnet-20240620:beta
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 3.0
per_output_token: 15.0
valid_from: null
input_formats:
- text
- image
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 200000
description: |-
Claude 3.5 Sonnet delivers better-than-Opus capabilities, faster-than-Sonnet speeds, at the same Sonnet prices. Sonnet is particularly good at:
- Coding: Autonomously writes, edits, and runs code with reasoning and troubleshooting
- Data science: Augments human data science expertise; navigates unstructured data while using multiple tools for insights
- Visual processing: excelling at interpreting charts, graphs, and images, accurately transcribing text to derive insights beyond just the text alone
- Agentic tasks: exceptional tool use, making it great at agentic tasks (i.e. complex, multi-step problem solving tasks that require engaging with other systems)
For the latest version (2024-10-23), check out [Claude 3.5 Sonnet](/anthropic/claude-3.5-sonnet).
#multimodal
parameters:
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: claude-3.5-sonnet:beta
model_provider: anthropic
inference_provider:
provider: openrouter
model_name: anthropic/claude-3.5-sonnet:beta
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 3.0
per_output_token: 15.0
valid_from: null
input_formats:
- text
- image
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 200000
description: |-
New Claude 3.5 Sonnet delivers better-than-Opus capabilities, faster-than-Sonnet speeds, at the same Sonnet prices. Sonnet is particularly good at:
- Coding: Scores ~49% on SWE-Bench Verified, higher than the last best score, and without any fancy prompt scaffolding
- Data science: Augments human data science expertise; navigates unstructured data while using multiple tools for insights
- Visual processing: excelling at interpreting charts, graphs, and images, accurately transcribing text to derive insights beyond just the text alone
- Agentic tasks: exceptional tool use, making it great at agentic tasks (i.e. complex, multi-step problem solving tasks that require engaging with other systems)
#multimodal
parameters:
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: claude-3-haiku
model_provider: anthropic
inference_provider:
provider: openrouter
model_name: anthropic/claude-3-haiku
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.25
per_output_token: 1.25
valid_from: null
input_formats:
- text
- image
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 200000
description: |-
Claude 3 Haiku is Anthropic's fastest and most compact model for
near-instant responsiveness. Quick and accurate targeted performance.
See the launch announcement and benchmark results [here](https://www.anthropic.com/news/claude-3-haiku)
#multimodal
parameters:
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: claude-3-haiku:beta
model_provider: anthropic
inference_provider:
provider: openrouter
model_name: anthropic/claude-3-haiku:beta
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.25
per_output_token: 1.25
valid_from: null
input_formats:
- text
- image
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 200000
description: |-
Claude 3 Haiku is Anthropic's fastest and most compact model for
near-instant responsiveness. Quick and accurate targeted performance.
See the launch announcement and benchmark results [here](https://www.anthropic.com/news/claude-3-haiku)
#multimodal
parameters:
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: claude-3-opus
model_provider: anthropic
inference_provider:
provider: openrouter
model_name: anthropic/claude-3-opus
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 15.0
per_output_token: 75.0
valid_from: null
input_formats:
- text
- image
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 200000
description: |-
Claude 3 Opus is Anthropic's most powerful model for highly complex tasks. It boasts top-level performance, intelligence, fluency, and understanding.
See the launch announcement and benchmark results [here](https://www.anthropic.com/news/claude-3-family)
#multimodal
parameters:
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: claude-3-opus:beta
model_provider: anthropic
inference_provider:
provider: openrouter
model_name: anthropic/claude-3-opus:beta
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 15.0
per_output_token: 75.0
valid_from: null
input_formats:
- text
- image
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 200000
description: |-
Claude 3 Opus is Anthropic's most powerful model for highly complex tasks. It boasts top-level performance, intelligence, fluency, and understanding.
See the launch announcement and benchmark results [here](https://www.anthropic.com/news/claude-3-family)
#multimodal
parameters:
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: claude-3-sonnet
model_provider: anthropic
inference_provider:
provider: openrouter
model_name: anthropic/claude-3-sonnet
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 3.0
per_output_token: 15.0
valid_from: null
input_formats:
- text
- image
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 200000
description: |-
Claude 3 Sonnet is an ideal balance of intelligence and speed for enterprise workloads. Maximum utility at a lower price, dependable, balanced for scaled deployments.
See the launch announcement and benchmark results [here](https://www.anthropic.com/news/claude-3-family)
#multimodal
parameters:
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: claude-3-sonnet:beta
model_provider: anthropic
inference_provider:
provider: openrouter
model_name: anthropic/claude-3-sonnet:beta
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 3.0
per_output_token: 15.0
valid_from: null
input_formats:
- text
- image
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 200000
description: |-
Claude 3 Sonnet is an ideal balance of intelligence and speed for enterprise workloads. Maximum utility at a lower price, dependable, balanced for scaled deployments.
See the launch announcement and benchmark results [here](https://www.anthropic.com/news/claude-3-family)
#multimodal
parameters:
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: codestral-2501
model_provider: mistralai
inference_provider:
provider: openrouter
model_name: mistralai/codestral-2501
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.3
per_output_token: 0.8999999999999999
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 256000
description: "[Mistral](/mistralai)'s cutting-edge language model for coding. Codestral specializes in low-latency, high-frequency tasks such as fill-in-the-middle (FIM), code correction and test generation. \n\nLearn more on their blog post: https://mistral.ai/news/codestral-2501/"
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: codestral-mamba
model_provider: mistralai
inference_provider:
provider: openrouter
model_name: mistralai/codestral-mamba
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.25
per_output_token: 0.25
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 256000
description: |-
A 7.3B parameter Mamba-based model designed for code and reasoning tasks.
- Linear time inference, allowing for theoretically infinite sequence lengths
- 256k token context window
- Optimized for quick responses, especially beneficial for code productivity
- Performs comparably to state-of-the-art transformer models in code and reasoning tasks
- Available under the Apache 2.0 license for free use, modification, and distribution
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: command
model_provider: cohere
inference_provider:
provider: openrouter
model_name: cohere/command
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.95
per_output_token: 1.9
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 4096
description: |-
Command is an instruction-following conversational model that performs language tasks with high quality, more reliably and with a longer context than our base generative models.
Use of this model is subject to Cohere's [Acceptable Use Policy](https://docs.cohere.com/docs/c4ai-acceptable-use-policy).
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
structured_outputs:
default: false
description: If true, instructs the model to return structured outputs (e.g., in JSON format) using the response_format provided.
required: false
type: boolean
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: command-r
model_provider: cohere
inference_provider:
provider: openrouter
model_name: cohere/command-r
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.475
per_output_token: 1.4249999999999998
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 128000
description: |-
Command-R is a 35B parameter model that performs conversational language tasks at a higher quality, more reliably, and with a longer context than previous models. It can be used for complex workflows like code generation, retrieval augmented generation (RAG), tool use, and agents.
Read the launch post [here](https://txt.cohere.com/command-r/).
Use of this model is subject to Cohere's [Acceptable Use Policy](https://docs.cohere.com/docs/c4ai-acceptable-use-policy).
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
structured_outputs:
default: false
description: If true, instructs the model to return structured outputs (e.g., in JSON format) using the response_format provided.
required: false
type: boolean
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: command-r-03-2024
model_provider: cohere
inference_provider:
provider: openrouter
model_name: cohere/command-r-03-2024
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.475
per_output_token: 1.4249999999999998
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 128000
description: |-
Command-R is a 35B parameter model that performs conversational language tasks at a higher quality, more reliably, and with a longer context than previous models. It can be used for complex workflows like code generation, retrieval augmented generation (RAG), tool use, and agents.
Read the launch post [here](https://txt.cohere.com/command-r/).
Use of this model is subject to Cohere's [Acceptable Use Policy](https://docs.cohere.com/docs/c4ai-acceptable-use-policy).
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
structured_outputs:
default: false
description: If true, instructs the model to return structured outputs (e.g., in JSON format) using the response_format provided.
required: false
type: boolean
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: command-r-08-2024
model_provider: cohere
inference_provider:
provider: openrouter
model_name: cohere/command-r-08-2024
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.14250000000000002
per_output_token: 0.5700000000000001
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 128000
description: |-
command-r-08-2024 is an update of the [Command R](/models/cohere/command-r) with improved performance for multilingual retrieval-augmented generation (RAG) and tool use. More broadly, it is better at math, code and reasoning and is competitive with the previous version of the larger Command R+ model.
Read the launch post [here](https://docs.cohere.com/changelog/command-gets-refreshed).
Use of this model is subject to Cohere's [Acceptable Use Policy](https://docs.cohere.com/docs/c4ai-acceptable-use-policy).
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
structured_outputs:
default: false
description: If true, instructs the model to return structured outputs (e.g., in JSON format) using the response_format provided.
required: false
type: boolean
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: command-r7b-12-2024
model_provider: cohere
inference_provider:
provider: openrouter
model_name: cohere/command-r7b-12-2024
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.0375
per_output_token: 0.15
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 128000
description: Command R7B (12-2024) is a small, fast update of the Command R+ model, delivered in December 2024. It excels at RAG, tool use, agents, and similar tasks requiring complex reasoning and multiple steps.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
structured_outputs:
default: false
description: If true, instructs the model to return structured outputs (e.g., in JSON format) using the response_format provided.
required: false
type: boolean
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: command-r-plus
model_provider: cohere
inference_provider:
provider: openrouter
model_name: cohere/command-r-plus
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 2.8499999999999996
per_output_token: 14.25
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 128000
description: |-
Command R+ is a new, 104B-parameter LLM from Cohere. It's useful for roleplay, general consumer usecases, and Retrieval Augmented Generation (RAG).
It offers multilingual support for ten key languages to facilitate global business operations. See benchmarks and the launch post [here](https://txt.cohere.com/command-r-plus-microsoft-azure/).
Use of this model is subject to Cohere's [Acceptable Use Policy](https://docs.cohere.com/docs/c4ai-acceptable-use-policy).
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
structured_outputs:
default: false
description: If true, instructs the model to return structured outputs (e.g., in JSON format) using the response_format provided.
required: false
type: boolean
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: command-r-plus-04-2024
model_provider: cohere
inference_provider:
provider: openrouter
model_name: cohere/command-r-plus-04-2024
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 2.8499999999999996
per_output_token: 14.25
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 128000
description: |-
Command R+ is a new, 104B-parameter LLM from Cohere. It's useful for roleplay, general consumer usecases, and Retrieval Augmented Generation (RAG).
It offers multilingual support for ten key languages to facilitate global business operations. See benchmarks and the launch post [here](https://txt.cohere.com/command-r-plus-microsoft-azure/).
Use of this model is subject to Cohere's [Acceptable Use Policy](https://docs.cohere.com/docs/c4ai-acceptable-use-policy).
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
structured_outputs:
default: false
description: If true, instructs the model to return structured outputs (e.g., in JSON format) using the response_format provided.
required: false
type: boolean
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: command-r-plus-08-2024
model_provider: cohere
inference_provider:
provider: openrouter
model_name: cohere/command-r-plus-08-2024
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 2.375
per_output_token: 9.5
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 128000
description: |-
command-r-plus-08-2024 is an update of the [Command R+](/models/cohere/command-r-plus) with roughly 50% higher throughput and 25% lower latencies as compared to the previous Command R+ version, while keeping the hardware footprint the same.
Read the launch post [here](https://docs.cohere.com/changelog/command-gets-refreshed).
Use of this model is subject to Cohere's [Acceptable Use Policy](https://docs.cohere.com/docs/c4ai-acceptable-use-policy).
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
structured_outputs:
default: false
description: If true, instructs the model to return structured outputs (e.g., in JSON format) using the response_format provided.
required: false
type: boolean
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: dbrx-instruct
model_provider: databricks
inference_provider:
provider: openrouter
model_name: databricks/dbrx-instruct
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 1.2
per_output_token: 1.2
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 32768
description: |-
DBRX is a new open source large language model developed by Databricks. At 132B, it outperforms existing open source LLMs like Llama 2 70B and [Mixtral-8x7b](/models/mistralai/mixtral-8x7b) on standard industry benchmarks for language understanding, programming, math, and logic.
It uses a fine-grained mixture-of-experts (MoE) architecture. 36B parameters are active on any input. It was pre-trained on 12T tokens of text and code data. Compared to other open MoE models like Mixtral-8x7B and Grok-1, DBRX is fine-grained, meaning it uses a larger number of smaller experts.
See the launch announcement and benchmark results [here](https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm).
#moe
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: deepseek-chat
model_provider: deepseek
inference_provider:
provider: openrouter
model_name: deepseek/deepseek-chat
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.49
per_output_token: 0.8899999999999999
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 16000
description: |-
DeepSeek-V3 is the latest model from the DeepSeek team, building upon the instruction following and coding abilities of the previous versions. Pre-trained on nearly 15 trillion tokens, the reported evaluations reveal that the model outperforms other open-source models and rivals leading closed-source models.
For model details, please visit [the DeepSeek-V3 repo](https://github.com/deepseek-ai/DeepSeek-V3) for more information, or see the [launch announcement](https://api-docs.deepseek.com/news/news1226).
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: deepseek-chat-v2.5
model_provider: deepseek
inference_provider:
provider: openrouter
model_name: deepseek/deepseek-chat-v2.5
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 2.0
per_output_token: 2.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 8192
description: DeepSeek-V2.5 is an upgraded version that combines DeepSeek-V2-Chat and DeepSeek-Coder-V2-Instruct. The new model integrates the general and coding abilities of the two previous versions. For model details, please visit [DeepSeek-V2 page](https://github.com/deepseek-ai/DeepSeek-V2) for more information.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_logprobs:
default: null
description: Specifies the number of most likely tokens (from 0 to 20) to return at each token position, each with its associated log probability. (Requires that logprobs is enabled.)
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: deepseek-r1
model_provider: deepseek
inference_provider:
provider: openrouter
model_name: deepseek/deepseek-r1
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 3.0
per_output_token: 8.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 163840
description: |-
DeepSeek R1 is here: Performance on par with [OpenAI o1](/openai/o1), but open-sourced and with fully open reasoning tokens. It's 671B parameters in size, with 37B active in an inference pass.
Fully open-source model & [technical report](https://api-docs.deepseek.com/news/news250120).
MIT licensed: Distill & commercialize freely!
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
include_reasoning:
default: false
description: If the endpoint can return reasoning explicitly, setting this parameter will include reasoning tokens in the response (available in a separate field).
required: false
type: boolean
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_logprobs:
default: null
description: Specifies the number of most likely tokens (from 0 to 20) to return at each token position, each with its associated log probability. (Requires that logprobs is enabled.)
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: deepseek-r1-distill-llama-70b
model_provider: deepseek
inference_provider:
provider: openrouter
model_name: deepseek/deepseek-r1-distill-llama-70b
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.23
per_output_token: 0.69
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 131072
description: |-
DeepSeek R1 Distill Llama 70B is a distilled large language model based on [Llama-3.3-70B-Instruct](/meta-llama/llama-3.3-70b-instruct), using outputs from [DeepSeek R1](/deepseek/deepseek-r1). The model combines advanced distillation techniques to achieve high performance across multiple benchmarks, including:
- AIME 2024 pass@1: 70.0
- MATH-500 pass@1: 94.5
- CodeForces Rating: 1633
The model leverages fine-tuning from DeepSeek R1's outputs, enabling competitive performance comparable to larger frontier models.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
include_reasoning:
default: false
description: If the endpoint can return reasoning explicitly, setting this parameter will include reasoning tokens in the response (available in a separate field).
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: deepseek-r1-distill-llama-70b:free
model_provider: deepseek
inference_provider:
provider: openrouter
model_name: deepseek/deepseek-r1-distill-llama-70b:free
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.0
per_output_token: 0.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 8192
description: |-
DeepSeek R1 Distill Llama 70B is a distilled large language model based on [Llama-3.3-70B-Instruct](/meta-llama/llama-3.3-70b-instruct), using outputs from [DeepSeek R1](/deepseek/deepseek-r1). The model combines advanced distillation techniques to achieve high performance across multiple benchmarks, including:
- AIME 2024 pass@1: 70.0
- MATH-500 pass@1: 94.5
- CodeForces Rating: 1633
The model leverages fine-tuning from DeepSeek R1's outputs, enabling competitive performance comparable to larger frontier models.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
include_reasoning:
default: false
description: If the endpoint can return reasoning explicitly, setting this parameter will include reasoning tokens in the response (available in a separate field).
required: false
type: boolean
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_a:
default: 0.0
description: Consider only tokens with sufficiently high probabilities relative to the top token. A lower value focuses the selection on tokens near the top probability, acting like a dynamic Top-P filter.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: deepseek-r1-distill-qwen-14b
model_provider: deepseek
inference_provider:
provider: openrouter
model_name: deepseek/deepseek-r1-distill-qwen-14b
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 1.6
per_output_token: 1.6
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 131072
description: |-
DeepSeek R1 Distill Qwen 14B is a distilled large language model based on [Qwen 2.5 14B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B), using outputs from [DeepSeek R1](/deepseek/deepseek-r1). It outperforms OpenAI's o1-mini across various benchmarks, achieving new state-of-the-art results for dense models.
Other benchmark results include:
- AIME 2024 pass@1: 69.7
- MATH-500 pass@1: 93.9
- CodeForces Rating: 1481
The model leverages fine-tuning from DeepSeek R1's outputs, enabling competitive performance comparable to larger frontier models.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
include_reasoning:
default: false
description: If the endpoint can return reasoning explicitly, setting this parameter will include reasoning tokens in the response (available in a separate field).
required: false
type: boolean
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: deepseek-r1-distill-qwen-1.5b
model_provider: deepseek
inference_provider:
provider: openrouter
model_name: deepseek/deepseek-r1-distill-qwen-1.5b
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.18
per_output_token: 0.18
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 131072
description: |-
DeepSeek R1 Distill Qwen 1.5B is a distilled large language model based on [Qwen 2.5 Math 1.5B](https://huggingface.co/Qwen/Qwen2.5-Math-1.5B), using outputs from [DeepSeek R1](/deepseek/deepseek-r1). It's a very small and efficient model which outperforms [GPT 4o 0513](/openai/gpt-4o-2024-05-13) on Math Benchmarks.
Other benchmark results include:
- AIME 2024 pass@1: 28.9
- AIME 2024 cons@64: 52.7
- MATH-500 pass@1: 83.9
The model leverages fine-tuning from DeepSeek R1's outputs, enabling competitive performance comparable to larger frontier models.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
include_reasoning:
default: false
description: If the endpoint can return reasoning explicitly, setting this parameter will include reasoning tokens in the response (available in a separate field).
required: false
type: boolean
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: deepseek-r1-distill-qwen-32b
model_provider: deepseek
inference_provider:
provider: openrouter
model_name: deepseek/deepseek-r1-distill-qwen-32b
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.12
per_output_token: 0.18
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 131072
description: |-
DeepSeek R1 Distill Qwen 32B is a distilled large language model based on [Qwen 2.5 32B](https://huggingface.co/Qwen/Qwen2.5-32B), using outputs from [DeepSeek R1](/deepseek/deepseek-r1). It outperforms OpenAI's o1-mini across various benchmarks, achieving new state-of-the-art results for dense models.
Other benchmark results include:
- AIME 2024 pass@1: 72.6
- MATH-500 pass@1: 94.3
- CodeForces Rating: 1691
The model leverages fine-tuning from DeepSeek R1's outputs, enabling competitive performance comparable to larger frontier models.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
include_reasoning:
default: false
description: If the endpoint can return reasoning explicitly, setting this parameter will include reasoning tokens in the response (available in a separate field).
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: deepseek-r1:free
model_provider: deepseek
inference_provider:
provider: openrouter
model_name: deepseek/deepseek-r1:free
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.0
per_output_token: 0.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 163840
description: |-
DeepSeek R1 is here: Performance on par with [OpenAI o1](/openai/o1), but open-sourced and with fully open reasoning tokens. It's 671B parameters in size, with 37B active in an inference pass.
Fully open-source model & [technical report](https://api-docs.deepseek.com/news/news250120).
MIT licensed: Distill & commercialize freely!
parameters:
include_reasoning:
default: false
description: If the endpoint can return reasoning explicitly, setting this parameter will include reasoning tokens in the response (available in a separate field).
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
- model: deepseek-r1:nitro
model_provider: deepseek
inference_provider:
provider: openrouter
model_name: deepseek/deepseek-r1:nitro
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 3.0
per_output_token: 8.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 163840
description: |-
DeepSeek R1 is here: Performance on par with [OpenAI o1](/openai/o1), but open-sourced and with fully open reasoning tokens. It's 671B parameters in size, with 37B active in an inference pass.
Fully open-source model & [technical report](https://api-docs.deepseek.com/news/news250120).
MIT licensed: Distill & commercialize freely!
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
include_reasoning:
default: false
description: If the endpoint can return reasoning explicitly, setting this parameter will include reasoning tokens in the response (available in a separate field).
required: false
type: boolean
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_logprobs:
default: null
description: Specifies the number of most likely tokens (from 0 to 20) to return at each token position, each with its associated log probability. (Requires that logprobs is enabled.)
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: dolphin-mixtral-8x7b
model_provider: cognitivecomputations
inference_provider:
provider: openrouter
model_name: cognitivecomputations/dolphin-mixtral-8x7b
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.5
per_output_token: 0.5
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 32768
description: |-
This is a 16k context fine-tune of [Mixtral-8x7b](/models/mistralai/mixtral-8x7b). It excels in coding tasks due to extensive training with coding data and is known for its obedience, although it lacks DPO tuning.
The model is uncensored and is stripped of alignment and bias. It requires an external alignment layer for ethical use. Users are cautioned to use this highly compliant model responsibly, as detailed in a blog post about uncensored models at [erichartford.com/uncensored-models](https://erichartford.com/uncensored-models).
#moe #uncensored
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: eva-llama-3.33-70b
model_provider: eva-unit-01
inference_provider:
provider: openrouter
model_name: eva-unit-01/eva-llama-3.33-70b
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 4.0
per_output_token: 6.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 16384
description: |
EVA Llama 3.33 70b is a roleplay and storywriting specialist model. It is a full-parameter finetune of [Llama-3.3-70B-Instruct](https://openrouter.ai/meta-llama/llama-3.3-70b-instruct) on mixture of synthetic and natural data.
It uses Celeste 70B 0.1 data mixture, greatly expanding it to improve versatility, creativity and "flavor" of the resulting model
This model was built with Llama by Meta.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: eva-qwen-2.5-32b
model_provider: eva-unit-01
inference_provider:
provider: openrouter
model_name: eva-unit-01/eva-qwen-2.5-32b
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 2.6
per_output_token: 3.4
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 16384
description: |-
EVA Qwen2.5 32B is a roleplaying/storywriting specialist model. It's a full-parameter finetune of Qwen2.5-32B on mixture of synthetic and natural data.
It uses Celeste 70B 0.1 data mixture, greatly expanding it to improve versatility, creativity and "flavor" of the resulting model.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: eva-qwen-2.5-72b
model_provider: eva-unit-01
inference_provider:
provider: openrouter
model_name: eva-unit-01/eva-qwen-2.5-72b
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 4.0
per_output_token: 6.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 16384
description: |-
EVA Qwen2.5 72B is a roleplay and storywriting specialist model. It's a full-parameter finetune of Qwen2.5-72B on mixture of synthetic and natural data.
It uses Celeste 70B 0.1 data mixture, greatly expanding it to improve versatility, creativity and "flavor" of the resulting model.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: fimbulvetr-11b-v2
model_provider: sao10k
inference_provider:
provider: openrouter
model_name: sao10k/fimbulvetr-11b-v2
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.7999999999999999
per_output_token: 1.2
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 4096
description: |-
Creative writing model, routed with permission. It's fast, it keeps the conversation going, and it stays in character.
If you submit a raw prompt, you can use Alpaca or Vicuna formats.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: gemini-2.0-flash-001
model_provider: google
inference_provider:
provider: openrouter
model_name: google/gemini-2.0-flash-001
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.1
per_output_token: 0.4
valid_from: null
input_formats:
- text
- image
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 1000000
description: Gemini Flash 2.0 offers a significantly faster time to first token (TTFT) compared to [Gemini Flash 1.5](/google/gemini-flash-1.5), while maintaining quality on par with larger models like [Gemini Pro 1.5](/google/gemini-pro-1.5). It introduces notable enhancements in multimodal understanding, coding capabilities, complex instruction following, and function calling. These advancements come together to deliver more seamless and robust agentic experiences.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
structured_outputs:
default: false
description: If true, instructs the model to return structured outputs (e.g., in JSON format) using the response_format provided.
required: false
type: boolean
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: gemini-2.0-flash-exp:free
model_provider: google
inference_provider:
provider: openrouter
model_name: google/gemini-2.0-flash-exp:free
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.0
per_output_token: 0.0
valid_from: null
input_formats:
- text
- image
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 1048576
description: Gemini Flash 2.0 offers a significantly faster time to first token (TTFT) compared to [Gemini Flash 1.5](google/gemini-flash-1.5), while maintaining quality on par with larger models like [Gemini Pro 1.5](google/gemini-pro-1.5). It introduces notable enhancements in multimodal understanding, coding capabilities, complex instruction following, and function calling. These advancements come together to deliver more seamless and robust agentic experiences.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
structured_outputs:
default: false
description: If true, instructs the model to return structured outputs (e.g., in JSON format) using the response_format provided.
required: false
type: boolean
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: gemini-2.0-flash-lite-preview-02-05:free
model_provider: google
inference_provider:
provider: openrouter
model_name: google/gemini-2.0-flash-lite-preview-02-05:free
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.0
per_output_token: 0.0
valid_from: null
input_formats:
- text
- image
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 1000000
description: Gemini Flash Lite 2.0 offers a significantly faster time to first token (TTFT) compared to [Gemini Flash 1.5](google/gemini-flash-1.5), while maintaining quality on par with larger models like [Gemini Pro 1.5](google/gemini-pro-1.5). Because it's currently in preview, it will be **heavily rate-limited** by Google. This model will move from free to paid pending a general rollout on February 24th, at $0.075 / $0.30 per million input / ouput tokens respectively.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
structured_outputs:
default: false
description: If true, instructs the model to return structured outputs (e.g., in JSON format) using the response_format provided.
required: false
type: boolean
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: gemini-2.0-flash-thinking-exp-1219:free
model_provider: google
inference_provider:
provider: openrouter
model_name: google/gemini-2.0-flash-thinking-exp-1219:free
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.0
per_output_token: 0.0
valid_from: null
input_formats:
- text
- image
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 40000
description: Gemini 2.0 Flash Thinking Mode is an experimental model that's trained to generate the "thinking process" the model goes through as part of its response. As a result, Thinking Mode is capable of stronger reasoning capabilities in its responses than the [base Gemini 2.0 Flash model](/google/gemini-2.0-flash-exp).
parameters:
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: gemini-2.0-flash-thinking-exp:free
model_provider: google
inference_provider:
provider: openrouter
model_name: google/gemini-2.0-flash-thinking-exp:free
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.0
per_output_token: 0.0
valid_from: null
input_formats:
- text
- image
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 1048576
description: |-
Gemini 2.0 Flash Thinking Experimental (01-21) is a snapshot of Gemini 2.0 Flash Thinking Experimental.
Gemini 2.0 Flash Thinking Mode is an experimental model that's trained to generate the "thinking process" the model goes through as part of its response. As a result, Thinking Mode is capable of stronger reasoning capabilities in its responses than the [base Gemini 2.0 Flash model](/google/gemini-2.0-flash-exp).
parameters:
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: gemini-2.0-pro-exp-02-05:free
model_provider: google
inference_provider:
provider: openrouter
model_name: google/gemini-2.0-pro-exp-02-05:free
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.0
per_output_token: 0.0
valid_from: null
input_formats:
- text
- image
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 2000000
description: |-
Gemini 2.0 Pro Experimental is a bleeding-edge version of the Gemini 2.0 Pro model. Because it's currently experimental, it will be **heavily rate-limited** by Google.
Usage of Gemini is subject to Google's [Gemini Terms of Use](https://ai.google.dev/terms).
#multimodal
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
structured_outputs:
default: false
description: If true, instructs the model to return structured outputs (e.g., in JSON format) using the response_format provided.
required: false
type: boolean
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: gemini-exp-1206:free
model_provider: google
inference_provider:
provider: openrouter
model_name: google/gemini-exp-1206:free
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.0
per_output_token: 0.0
valid_from: null
input_formats:
- text
- image
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 2097152
description: Experimental release (December 6, 2024) of Gemini.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
structured_outputs:
default: false
description: If true, instructs the model to return structured outputs (e.g., in JSON format) using the response_format provided.
required: false
type: boolean
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: gemini-flash-1.5
model_provider: google
inference_provider:
provider: openrouter
model_name: google/gemini-flash-1.5
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.075
per_output_token: 0.3
valid_from: null
input_formats:
- text
- image
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 1000000
description: |-
Gemini 1.5 Flash is a foundation model that performs well at a variety of multimodal tasks such as visual understanding, classification, summarization, and creating content from image, audio and video. It's adept at processing visual and text inputs such as photographs, documents, infographics, and screenshots.
Gemini 1.5 Flash is designed for high-volume, high-frequency tasks where cost and latency matter. On most common tasks, Flash achieves comparable quality to other Gemini Pro models at a significantly reduced cost. Flash is well-suited for applications like chat assistants and on-demand content generation where speed and scale matter.
Usage of Gemini is subject to Google's [Gemini Terms of Use](https://ai.google.dev/terms).
#multimodal
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
structured_outputs:
default: false
description: If true, instructs the model to return structured outputs (e.g., in JSON format) using the response_format provided.
required: false
type: boolean
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: gemini-flash-1.5-8b
model_provider: google
inference_provider:
provider: openrouter
model_name: google/gemini-flash-1.5-8b
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.0375
per_output_token: 0.15
valid_from: null
input_formats:
- text
- image
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 1000000
description: |-
Gemini Flash 1.5 8B is optimized for speed and efficiency, offering enhanced performance in small prompt tasks like chat, transcription, and translation. With reduced latency, it is highly effective for real-time and large-scale operations. This model focuses on cost-effective solutions while maintaining high-quality results.
[Click here to learn more about this model](https://developers.googleblog.com/en/gemini-15-flash-8b-is-now-generally-available-for-use/).
Usage of Gemini is subject to Google's [Gemini Terms of Use](https://ai.google.dev/terms).
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
structured_outputs:
default: false
description: If true, instructs the model to return structured outputs (e.g., in JSON format) using the response_format provided.
required: false
type: boolean
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: gemini-flash-1.5-8b-exp
model_provider: google
inference_provider:
provider: openrouter
model_name: google/gemini-flash-1.5-8b-exp
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.0
per_output_token: 0.0
valid_from: null
input_formats:
- text
- image
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 1000000
description: |-
Gemini Flash 1.5 8B Experimental is an experimental, 8B parameter version of the [Gemini Flash 1.5](/models/google/gemini-flash-1.5) model.
Usage of Gemini is subject to Google's [Gemini Terms of Use](https://ai.google.dev/terms).
#multimodal
Note: This model is currently experimental and not suitable for production use-cases, and may be heavily rate-limited.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
structured_outputs:
default: false
description: If true, instructs the model to return structured outputs (e.g., in JSON format) using the response_format provided.
required: false
type: boolean
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: gemini-pro
model_provider: google
inference_provider:
provider: openrouter
model_name: google/gemini-pro
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.5
per_output_token: 1.5
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 32760
description: |-
Google's flagship text generation model. Designed to handle natural language tasks, multiturn text and code chat, and code generation.
See the benchmarks and prompting guidelines from [Deepmind](https://deepmind.google/technologies/gemini/).
Usage of Gemini is subject to Google's [Gemini Terms of Use](https://ai.google.dev/terms).
parameters:
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: gemini-pro-1.5
model_provider: google
inference_provider:
provider: openrouter
model_name: google/gemini-pro-1.5
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 1.25
per_output_token: 5.0
valid_from: null
input_formats:
- text
- image
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 2000000
description: |-
Google's latest multimodal model, supports image and video[0] in text or chat prompts.
Optimized for language tasks including:
- Code generation
- Text generation
- Text editing
- Problem solving
- Recommendations
- Information extraction
- Data extraction or generation
- AI agents
Usage of Gemini is subject to Google's [Gemini Terms of Use](https://ai.google.dev/terms).
* [0]: Video input is not available through OpenRouter at this time.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
structured_outputs:
default: false
description: If true, instructs the model to return structured outputs (e.g., in JSON format) using the response_format provided.
required: false
type: boolean
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: gemini-pro-vision
model_provider: google
inference_provider:
provider: openrouter
model_name: google/gemini-pro-vision
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.5
per_output_token: 1.5
valid_from: null
input_formats:
- text
- image
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 16384
description: |-
Google's flagship multimodal model, supporting image and video in text or chat prompts for a text or code response.
See the benchmarks and prompting guidelines from [Deepmind](https://deepmind.google/technologies/gemini/).
Usage of Gemini is subject to Google's [Gemini Terms of Use](https://ai.google.dev/terms).
#multimodal
parameters:
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: gemma-2-27b-it
model_provider: google
inference_provider:
provider: openrouter
model_name: google/gemma-2-27b-it
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.27
per_output_token: 0.27
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 8192
description: |-
Gemma 2 27B by Google is an open model built from the same research and technology used to create the [Gemini models](/models?q=gemini).
Gemma models are well-suited for a variety of text generation tasks, including question answering, summarization, and reasoning.
See the [launch announcement](https://blog.google/technology/developers/google-gemma-2/) for more details. Usage of Gemma is subject to Google's [Gemma Terms of Use](https://ai.google.dev/gemma/terms).
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: gemma-2-9b-it
model_provider: google
inference_provider:
provider: openrouter
model_name: google/gemma-2-9b-it
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.03
per_output_token: 0.06
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 8192
description: |-
Gemma 2 9B by Google is an advanced, open-source language model that sets a new standard for efficiency and performance in its size class.
Designed for a wide variety of tasks, it empowers developers and researchers to build innovative applications, while maintaining accessibility, safety, and cost-effectiveness.
See the [launch announcement](https://blog.google/technology/developers/google-gemma-2/) for more details. Usage of Gemma is subject to Google's [Gemma Terms of Use](https://ai.google.dev/gemma/terms).
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: gemma-2-9b-it:free
model_provider: google
inference_provider:
provider: openrouter
model_name: google/gemma-2-9b-it:free
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.0
per_output_token: 0.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 8192
description: |-
Gemma 2 9B by Google is an advanced, open-source language model that sets a new standard for efficiency and performance in its size class.
Designed for a wide variety of tasks, it empowers developers and researchers to build innovative applications, while maintaining accessibility, safety, and cost-effectiveness.
See the [launch announcement](https://blog.google/technology/developers/google-gemma-2/) for more details. Usage of Gemma is subject to Google's [Gemma Terms of Use](https://ai.google.dev/gemma/terms).
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: gemma-7b-it
model_provider: google
inference_provider:
provider: openrouter
model_name: google/gemma-7b-it
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.15
per_output_token: 0.15
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 8192
description: |-
Gemma by Google is an advanced, open-source language model family, leveraging the latest in decoder-only, text-to-text technology. It offers English language capabilities across text generation tasks like question answering, summarization, and reasoning. The Gemma 7B variant is comparable in performance to leading open source models.
Usage of Gemma is subject to Google's [Gemma Terms of Use](https://ai.google.dev/gemma/terms).
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: goliath-120b
model_provider: alpindale
inference_provider:
provider: openrouter
model_name: alpindale/goliath-120b
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 9.375
per_output_token: 9.375
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 6144
description: |-
A large LLM created by combining two fine-tuned Llama 70B models into one 120B model. Combines Xwin and Euryale.
Credits to
- [@chargoddard](https://huggingface.co/chargoddard) for developing the framework used to merge the model - [mergekit](https://github.com/cg123/mergekit).
- [@Undi95](https://huggingface.co/Undi95) for helping with the merge ratios.
#merge
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_a:
default: 0.0
description: Consider only tokens with sufficiently high probabilities relative to the top token. A lower value focuses the selection on tokens near the top probability, acting like a dynamic Top-P filter.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: gpt-3.5-turbo
model_provider: openai
inference_provider:
provider: openrouter
model_name: openai/gpt-3.5-turbo
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.5
per_output_token: 1.5
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 16385
description: |-
GPT-3.5 Turbo is OpenAI's fastest model. It can understand and generate natural language or code, and is optimized for chat and traditional completion tasks.
Training data up to Sep 2021.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_logprobs:
default: null
description: Specifies the number of most likely tokens (from 0 to 20) to return at each token position, each with its associated log probability. (Requires that logprobs is enabled.)
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: gpt-3.5-turbo-0125
model_provider: openai
inference_provider:
provider: openrouter
model_name: openai/gpt-3.5-turbo-0125
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.5
per_output_token: 1.5
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 16385
description: |-
The latest GPT-3.5 Turbo model with improved instruction following, JSON mode, reproducible outputs, parallel function calling, and more. Training data: up to Sep 2021.
This version has a higher accuracy at responding in requested formats and a fix for a bug which caused a text encoding issue for non-English language function calls.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_logprobs:
default: null
description: Specifies the number of most likely tokens (from 0 to 20) to return at each token position, each with its associated log probability. (Requires that logprobs is enabled.)
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: gpt-3.5-turbo-0613
model_provider: openai
inference_provider:
provider: openrouter
model_name: openai/gpt-3.5-turbo-0613
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 1.0
per_output_token: 2.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 4095
description: |-
GPT-3.5 Turbo is OpenAI's fastest model. It can understand and generate natural language or code, and is optimized for chat and traditional completion tasks.
Training data up to Sep 2021.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_logprobs:
default: null
description: Specifies the number of most likely tokens (from 0 to 20) to return at each token position, each with its associated log probability. (Requires that logprobs is enabled.)
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: gpt-3.5-turbo-1106
model_provider: openai
inference_provider:
provider: openrouter
model_name: openai/gpt-3.5-turbo-1106
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 1.0
per_output_token: 2.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 16385
description: 'An older GPT-3.5 Turbo model with improved instruction following, JSON mode, reproducible outputs, parallel function calling, and more. Training data: up to Sep 2021.'
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_logprobs:
default: null
description: Specifies the number of most likely tokens (from 0 to 20) to return at each token position, each with its associated log probability. (Requires that logprobs is enabled.)
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: gpt-3.5-turbo-16k
model_provider: openai
inference_provider:
provider: openrouter
model_name: openai/gpt-3.5-turbo-16k
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 3.0
per_output_token: 4.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 16385
description: 'This model offers four times the context length of gpt-3.5-turbo, allowing it to support approximately 20 pages of text in a single request at a higher cost. Training data: up to Sep 2021.'
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_logprobs:
default: null
description: Specifies the number of most likely tokens (from 0 to 20) to return at each token position, each with its associated log probability. (Requires that logprobs is enabled.)
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: gpt-3.5-turbo-instruct
model_provider: openai
inference_provider:
provider: openrouter
model_name: openai/gpt-3.5-turbo-instruct
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 1.5
per_output_token: 2.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 4095
description: 'This model is a variant of GPT-3.5 Turbo tuned for instructional prompts and omitting chat-related optimizations. Training data: up to Sep 2021.'
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_logprobs:
default: null
description: Specifies the number of most likely tokens (from 0 to 20) to return at each token position, each with its associated log probability. (Requires that logprobs is enabled.)
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: gpt-4
model_provider: openai
inference_provider:
provider: openrouter
model_name: openai/gpt-4
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 30.0
per_output_token: 60.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 8191
description: 'OpenAI''s flagship model, GPT-4 is a large-scale multimodal language model capable of solving difficult problems with greater accuracy than previous models due to its broader general knowledge and advanced reasoning capabilities. Training data: up to Sep 2021.'
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_logprobs:
default: null
description: Specifies the number of most likely tokens (from 0 to 20) to return at each token position, each with its associated log probability. (Requires that logprobs is enabled.)
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: gpt-4-0314
model_provider: openai
inference_provider:
provider: openrouter
model_name: openai/gpt-4-0314
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 30.0
per_output_token: 60.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 8191
description: 'GPT-4-0314 is the first version of GPT-4 released, with a context length of 8,192 tokens, and was supported until June 14. Training data: up to Sep 2021.'
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_logprobs:
default: null
description: Specifies the number of most likely tokens (from 0 to 20) to return at each token position, each with its associated log probability. (Requires that logprobs is enabled.)
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: gpt-4-1106-preview
model_provider: openai
inference_provider:
provider: openrouter
model_name: openai/gpt-4-1106-preview
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 10.0
per_output_token: 30.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 128000
description: |-
The latest GPT-4 Turbo model with vision capabilities. Vision requests can now use JSON mode and function calling.
Training data: up to April 2023.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_logprobs:
default: null
description: Specifies the number of most likely tokens (from 0 to 20) to return at each token position, each with its associated log probability. (Requires that logprobs is enabled.)
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: gpt-4-32k
model_provider: openai
inference_provider:
provider: openrouter
model_name: openai/gpt-4-32k
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 60.0
per_output_token: 120.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 32767
description: 'GPT-4-32k is an extended version of GPT-4, with the same capabilities but quadrupled context length, allowing for processing up to 40 pages of text in a single pass. This is particularly beneficial for handling longer content like interacting with PDFs without an external vector database. Training data: up to Sep 2021.'
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_logprobs:
default: null
description: Specifies the number of most likely tokens (from 0 to 20) to return at each token position, each with its associated log probability. (Requires that logprobs is enabled.)
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: gpt-4-32k-0314
model_provider: openai
inference_provider:
provider: openrouter
model_name: openai/gpt-4-32k-0314
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 60.0
per_output_token: 120.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 32767
description: 'GPT-4-32k is an extended version of GPT-4, with the same capabilities but quadrupled context length, allowing for processing up to 40 pages of text in a single pass. This is particularly beneficial for handling longer content like interacting with PDFs without an external vector database. Training data: up to Sep 2021.'
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_logprobs:
default: null
description: Specifies the number of most likely tokens (from 0 to 20) to return at each token position, each with its associated log probability. (Requires that logprobs is enabled.)
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: gpt-4o
model_provider: openai
inference_provider:
provider: openrouter
model_name: openai/gpt-4o
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 2.5
per_output_token: 10.0
valid_from: null
input_formats:
- text
- image
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 128000
description: |-
GPT-4o ("o" for "omni") is OpenAI's latest AI model, supporting both text and image inputs with text outputs. It maintains the intelligence level of [GPT-4 Turbo](/models/openai/gpt-4-turbo) while being twice as fast and 50% more cost-effective. GPT-4o also offers improved performance in processing non-English languages and enhanced visual capabilities.
For benchmarking against other models, it was briefly called ["im-also-a-good-gpt2-chatbot"](https://twitter.com/LiamFedus/status/1790064963966370209)
#multimodal
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
structured_outputs:
default: false
description: If true, instructs the model to return structured outputs (e.g., in JSON format) using the response_format provided.
required: false
type: boolean
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_logprobs:
default: null
description: Specifies the number of most likely tokens (from 0 to 20) to return at each token position, each with its associated log probability. (Requires that logprobs is enabled.)
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: gpt-4o-2024-05-13
model_provider: openai
inference_provider:
provider: openrouter
model_name: openai/gpt-4o-2024-05-13
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 5.0
per_output_token: 15.0
valid_from: null
input_formats:
- text
- image
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 128000
description: |-
GPT-4o ("o" for "omni") is OpenAI's latest AI model, supporting both text and image inputs with text outputs. It maintains the intelligence level of [GPT-4 Turbo](/models/openai/gpt-4-turbo) while being twice as fast and 50% more cost-effective. GPT-4o also offers improved performance in processing non-English languages and enhanced visual capabilities.
For benchmarking against other models, it was briefly called ["im-also-a-good-gpt2-chatbot"](https://twitter.com/LiamFedus/status/1790064963966370209)
#multimodal
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
structured_outputs:
default: false
description: If true, instructs the model to return structured outputs (e.g., in JSON format) using the response_format provided.
required: false
type: boolean
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_logprobs:
default: null
description: Specifies the number of most likely tokens (from 0 to 20) to return at each token position, each with its associated log probability. (Requires that logprobs is enabled.)
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: gpt-4o-2024-08-06
model_provider: openai
inference_provider:
provider: openrouter
model_name: openai/gpt-4o-2024-08-06
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 2.5
per_output_token: 10.0
valid_from: null
input_formats:
- text
- image
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 128000
description: |-
The 2024-08-06 version of GPT-4o offers improved performance in structured outputs, with the ability to supply a JSON schema in the respone_format. Read more [here](https://openai.com/index/introducing-structured-outputs-in-the-api/).
GPT-4o ("o" for "omni") is OpenAI's latest AI model, supporting both text and image inputs with text outputs. It maintains the intelligence level of [GPT-4 Turbo](/models/openai/gpt-4-turbo) while being twice as fast and 50% more cost-effective. GPT-4o also offers improved performance in processing non-English languages and enhanced visual capabilities.
For benchmarking against other models, it was briefly called ["im-also-a-good-gpt2-chatbot"](https://twitter.com/LiamFedus/status/1790064963966370209)
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
structured_outputs:
default: false
description: If true, instructs the model to return structured outputs (e.g., in JSON format) using the response_format provided.
required: false
type: boolean
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_logprobs:
default: null
description: Specifies the number of most likely tokens (from 0 to 20) to return at each token position, each with its associated log probability. (Requires that logprobs is enabled.)
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: gpt-4o-2024-11-20
model_provider: openai
inference_provider:
provider: openrouter
model_name: openai/gpt-4o-2024-11-20
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 2.5
per_output_token: 10.0
valid_from: null
input_formats:
- text
- image
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 128000
description: |-
The 2024-11-20 version of GPT-4o offers a leveled-up creative writing ability with more natural, engaging, and tailored writing to improve relevance & readability. It’s also better at working with uploaded files, providing deeper insights & more thorough responses.
GPT-4o ("o" for "omni") is OpenAI's latest AI model, supporting both text and image inputs with text outputs. It maintains the intelligence level of [GPT-4 Turbo](/models/openai/gpt-4-turbo) while being twice as fast and 50% more cost-effective. GPT-4o also offers improved performance in processing non-English languages and enhanced visual capabilities.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
structured_outputs:
default: false
description: If true, instructs the model to return structured outputs (e.g., in JSON format) using the response_format provided.
required: false
type: boolean
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_logprobs:
default: null
description: Specifies the number of most likely tokens (from 0 to 20) to return at each token position, each with its associated log probability. (Requires that logprobs is enabled.)
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: gpt-4o:extended
model_provider: openai
inference_provider:
provider: openrouter
model_name: openai/gpt-4o:extended
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 6.0
per_output_token: 18.0
valid_from: null
input_formats:
- text
- image
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 128000
description: |-
GPT-4o ("o" for "omni") is OpenAI's latest AI model, supporting both text and image inputs with text outputs. It maintains the intelligence level of [GPT-4 Turbo](/models/openai/gpt-4-turbo) while being twice as fast and 50% more cost-effective. GPT-4o also offers improved performance in processing non-English languages and enhanced visual capabilities.
For benchmarking against other models, it was briefly called ["im-also-a-good-gpt2-chatbot"](https://twitter.com/LiamFedus/status/1790064963966370209)
#multimodal
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
structured_outputs:
default: false
description: If true, instructs the model to return structured outputs (e.g., in JSON format) using the response_format provided.
required: false
type: boolean
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_logprobs:
default: null
description: Specifies the number of most likely tokens (from 0 to 20) to return at each token position, each with its associated log probability. (Requires that logprobs is enabled.)
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: gpt-4o-mini
model_provider: openai
inference_provider:
provider: openrouter
model_name: openai/gpt-4o-mini
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.15
per_output_token: 0.6
valid_from: null
input_formats:
- text
- image
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 128000
description: |-
GPT-4o mini is OpenAI's newest model after [GPT-4 Omni](/models/openai/gpt-4o), supporting both text and image inputs with text outputs.
As their most advanced small model, it is many multiples more affordable than other recent frontier models, and more than 60% cheaper than [GPT-3.5 Turbo](/models/openai/gpt-3.5-turbo). It maintains SOTA intelligence, while being significantly more cost-effective.
GPT-4o mini achieves an 82% score on MMLU and presently ranks higher than GPT-4 on chat preferences [common leaderboards](https://arena.lmsys.org/).
Check out the [launch announcement](https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/) to learn more.
#multimodal
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
structured_outputs:
default: false
description: If true, instructs the model to return structured outputs (e.g., in JSON format) using the response_format provided.
required: false
type: boolean
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_logprobs:
default: null
description: Specifies the number of most likely tokens (from 0 to 20) to return at each token position, each with its associated log probability. (Requires that logprobs is enabled.)
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: gpt-4o-mini-2024-07-18
model_provider: openai
inference_provider:
provider: openrouter
model_name: openai/gpt-4o-mini-2024-07-18
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.15
per_output_token: 0.6
valid_from: null
input_formats:
- text
- image
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 128000
description: |-
GPT-4o mini is OpenAI's newest model after [GPT-4 Omni](/models/openai/gpt-4o), supporting both text and image inputs with text outputs.
As their most advanced small model, it is many multiples more affordable than other recent frontier models, and more than 60% cheaper than [GPT-3.5 Turbo](/models/openai/gpt-3.5-turbo). It maintains SOTA intelligence, while being significantly more cost-effective.
GPT-4o mini achieves an 82% score on MMLU and presently ranks higher than GPT-4 on chat preferences [common leaderboards](https://arena.lmsys.org/).
Check out the [launch announcement](https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/) to learn more.
#multimodal
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
structured_outputs:
default: false
description: If true, instructs the model to return structured outputs (e.g., in JSON format) using the response_format provided.
required: false
type: boolean
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_logprobs:
default: null
description: Specifies the number of most likely tokens (from 0 to 20) to return at each token position, each with its associated log probability. (Requires that logprobs is enabled.)
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: gpt-4-turbo
model_provider: openai
inference_provider:
provider: openrouter
model_name: openai/gpt-4-turbo
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 10.0
per_output_token: 30.0
valid_from: null
input_formats:
- text
- image
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 128000
description: |-
The latest GPT-4 Turbo model with vision capabilities. Vision requests can now use JSON mode and function calling.
Training data: up to December 2023.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_logprobs:
default: null
description: Specifies the number of most likely tokens (from 0 to 20) to return at each token position, each with its associated log probability. (Requires that logprobs is enabled.)
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: gpt-4-turbo-preview
model_provider: openai
inference_provider:
provider: openrouter
model_name: openai/gpt-4-turbo-preview
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 10.0
per_output_token: 30.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 128000
description: |-
The preview GPT-4 model with improved instruction following, JSON mode, reproducible outputs, parallel function calling, and more. Training data: up to Dec 2023.
**Note:** heavily rate limited by OpenAI while in preview.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_logprobs:
default: null
description: Specifies the number of most likely tokens (from 0 to 20) to return at each token position, each with its associated log probability. (Requires that logprobs is enabled.)
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: grok-2-1212
model_provider: x-ai
inference_provider:
provider: openrouter
model_name: x-ai/grok-2-1212
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 2.0
per_output_token: 10.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 131072
description: Grok 2 1212 introduces significant enhancements to accuracy, instruction adherence, and multilingual support, making it a powerful and flexible choice for developers seeking a highly steerable, intelligent model.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_logprobs:
default: null
description: Specifies the number of most likely tokens (from 0 to 20) to return at each token position, each with its associated log probability. (Requires that logprobs is enabled.)
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: grok-2-vision-1212
model_provider: x-ai
inference_provider:
provider: openrouter
model_name: x-ai/grok-2-vision-1212
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 2.0
per_output_token: 10.0
valid_from: null
input_formats:
- text
- image
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 32768
description: |-
Grok 2 Vision 1212 advances image-based AI with stronger visual comprehension, refined instruction-following, and multilingual support. From object recognition to style analysis, it empowers developers to build more intuitive, visually aware applications. Its enhanced steerability and reasoning establish a robust foundation for next-generation image solutions.
To read more about this model, check out [xAI's announcement](https://x.ai/blog/grok-1212).
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_logprobs:
default: null
description: Specifies the number of most likely tokens (from 0 to 20) to return at each token position, each with its associated log probability. (Requires that logprobs is enabled.)
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: grok-beta
model_provider: x-ai
inference_provider:
provider: openrouter
model_name: x-ai/grok-beta
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 5.0
per_output_token: 15.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 131072
description: |-
Grok Beta is xAI's experimental language model with state-of-the-art reasoning capabilities, best for complex and multi-step use cases.
It is the successor of [Grok 2](https://x.ai/blog/grok-2) with enhanced context length.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_logprobs:
default: null
description: Specifies the number of most likely tokens (from 0 to 20) to return at each token position, each with its associated log probability. (Requires that logprobs is enabled.)
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: grok-vision-beta
model_provider: x-ai
inference_provider:
provider: openrouter
model_name: x-ai/grok-vision-beta
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 5.0
per_output_token: 15.0
valid_from: null
input_formats:
- text
- image
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 8192
description: |+
Grok Vision Beta is xAI's experimental language model with vision capability.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_logprobs:
default: null
description: Specifies the number of most likely tokens (from 0 to 20) to return at each token position, each with its associated log probability. (Requires that logprobs is enabled.)
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: hermes-2-pro-llama-3-8b
model_provider: nousresearch
inference_provider:
provider: openrouter
model_name: nousresearch/hermes-2-pro-llama-3-8b
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.025
per_output_token: 0.04
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 131000
description: Hermes 2 Pro is an upgraded, retrained version of Nous Hermes 2, consisting of an updated and cleaned version of the OpenHermes 2.5 Dataset, as well as a newly introduced Function Calling and JSON Mode dataset developed in-house.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_logprobs:
default: null
description: Specifies the number of most likely tokens (from 0 to 20) to return at each token position, each with its associated log probability. (Requires that logprobs is enabled.)
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: hermes-3-llama-3.1-405b
model_provider: nousresearch
inference_provider:
provider: openrouter
model_name: nousresearch/hermes-3-llama-3.1-405b
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.7999999999999999
per_output_token: 0.7999999999999999
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 131000
description: |-
Hermes 3 is a generalist language model with many improvements over Hermes 2, including advanced agentic capabilities, much better roleplaying, reasoning, multi-turn conversation, long context coherence, and improvements across the board.
Hermes 3 405B is a frontier-level, full-parameter finetune of the Llama-3.1 405B foundation model, focused on aligning LLMs to the user, with powerful steering capabilities and control given to the end user.
The Hermes 3 series builds and expands on the Hermes 2 set of capabilities, including more powerful and reliable function calling and structured output capabilities, generalist assistant capabilities, and improved code generation skills.
Hermes 3 is competitive, if not superior, to Llama-3.1 Instruct models at general capabilities, with varying strengths and weaknesses attributable between the two.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_logprobs:
default: null
description: Specifies the number of most likely tokens (from 0 to 20) to return at each token position, each with its associated log probability. (Requires that logprobs is enabled.)
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: hermes-3-llama-3.1-70b
model_provider: nousresearch
inference_provider:
provider: openrouter
model_name: nousresearch/hermes-3-llama-3.1-70b
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.12
per_output_token: 0.3
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 131000
description: |-
Hermes 3 is a generalist language model with many improvements over [Hermes 2](/models/nousresearch/nous-hermes-2-mistral-7b-dpo), including advanced agentic capabilities, much better roleplaying, reasoning, multi-turn conversation, long context coherence, and improvements across the board.
Hermes 3 70B is a competitive, if not superior finetune of the [Llama-3.1 70B foundation model](/models/meta-llama/llama-3.1-70b-instruct), focused on aligning LLMs to the user, with powerful steering capabilities and control given to the end user.
The Hermes 3 series builds and expands on the Hermes 2 set of capabilities, including more powerful and reliable function calling and structured output capabilities, generalist assistant capabilities, and improved code generation skills.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_logprobs:
default: null
description: Specifies the number of most likely tokens (from 0 to 20) to return at each token position, each with its associated log probability. (Requires that logprobs is enabled.)
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: inflection-3-pi
model_provider: inflection
inference_provider:
provider: openrouter
model_name: inflection/inflection-3-pi
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 2.5
per_output_token: 10.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 8000
description: |-
Inflection 3 Pi powers Inflection's [Pi](https://pi.ai) chatbot, including backstory, emotional intelligence, productivity, and safety. It has access to recent news, and excels in scenarios like customer support and roleplay.
Pi has been trained to mirror your tone and style, if you use more emojis, so will Pi! Try experimenting with various prompts and conversation styles.
parameters:
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: inflection-3-productivity
model_provider: inflection
inference_provider:
provider: openrouter
model_name: inflection/inflection-3-productivity
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 2.5
per_output_token: 10.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 8000
description: |-
Inflection 3 Productivity is optimized for following instructions. It is better for tasks requiring JSON output or precise adherence to provided guidelines. It has access to recent news.
For emotional intelligence similar to Pi, see [Inflect 3 Pi](/inflection/inflection-3-pi)
See [Inflection's announcement](https://inflection.ai/blog/enterprise) for more details.
parameters:
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: jamba-1-5-large
model_provider: ai21
inference_provider:
provider: openrouter
model_name: ai21/jamba-1-5-large
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 2.0
per_output_token: 8.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 256000
description: |-
Jamba 1.5 Large is part of AI21's new family of open models, offering superior speed, efficiency, and quality.
It features a 256K effective context window, the longest among open models, enabling improved performance on tasks like document summarization and analysis.
Built on a novel SSM-Transformer architecture, it outperforms larger models like Llama 3.1 70B on benchmarks while maintaining resource efficiency.
Read their [announcement](https://www.ai21.com/blog/announcing-jamba-model-family) to learn more.
parameters:
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: jamba-1-5-mini
model_provider: ai21
inference_provider:
provider: openrouter
model_name: ai21/jamba-1-5-mini
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.2
per_output_token: 0.4
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 256000
description: |-
Jamba 1.5 Mini is the world's first production-grade Mamba-based model, combining SSM and Transformer architectures for a 256K context window and high efficiency.
It works with 9 languages and can handle various writing and analysis tasks as well as or better than similar small models.
This model uses less computer memory and works faster with longer texts than previous designs.
Read their [announcement](https://www.ai21.com/blog/announcing-jamba-model-family) to learn more.
parameters:
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: jamba-instruct
model_provider: ai21
inference_provider:
provider: openrouter
model_name: ai21/jamba-instruct
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.5
per_output_token: 0.7
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 256000
description: |-
The Jamba-Instruct model, introduced by AI21 Labs, is an instruction-tuned variant of their hybrid SSM-Transformer Jamba model, specifically optimized for enterprise applications.
- 256K Context Window: It can process extensive information, equivalent to a 400-page novel, which is beneficial for tasks involving large documents such as financial reports or legal documents
- Safety and Accuracy: Jamba-Instruct is designed with enhanced safety features to ensure secure deployment in enterprise environments, reducing the risk and cost of implementation
Read their [announcement](https://www.ai21.com/blog/announcing-jamba) to learn more.
Jamba has a knowledge cutoff of February 2024.
parameters:
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: l3.1-70b-hanami-x1
model_provider: sao10k
inference_provider:
provider: openrouter
model_name: sao10k/l3.1-70b-hanami-x1
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 3.0
per_output_token: 3.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 16000
description: This is [Sao10K](/sao10k)'s experiment over [Euryale v2.2](/sao10k/l3.1-euryale-70b).
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: l3.1-euryale-70b
model_provider: sao10k
inference_provider:
provider: openrouter
model_name: sao10k/l3.1-euryale-70b
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.7
per_output_token: 0.7999999999999999
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 131072
description: Euryale L3.1 70B v2.2 is a model focused on creative roleplay from [Sao10k](https://ko-fi.com/sao10k). It is the successor of [Euryale L3 70B v2.1](/models/sao10k/l3-euryale-70b).
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: l3.3-euryale-70b
model_provider: sao10k
inference_provider:
provider: openrouter
model_name: sao10k/l3.3-euryale-70b
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.7
per_output_token: 0.7999999999999999
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 131072
description: Euryale L3.3 70B is a model focused on creative roleplay from [Sao10k](https://ko-fi.com/sao10k). It is the successor of [Euryale L3 70B v2.2](/models/sao10k/l3-euryale-70b).
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: l3-euryale-70b
model_provider: sao10k
inference_provider:
provider: openrouter
model_name: sao10k/l3-euryale-70b
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.7
per_output_token: 0.7999999999999999
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 8192
description: |-
Euryale 70B v2.1 is a model focused on creative roleplay from [Sao10k](https://ko-fi.com/sao10k).
- Better prompt adherence.
- Better anatomy / spatial awareness.
- Adapts much better to unique and custom formatting / reply formats.
- Very creative, lots of unique swipes.
- Is not restrictive during roleplays.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: l3-lunaris-8b
model_provider: sao10k
inference_provider:
provider: openrouter
model_name: sao10k/l3-lunaris-8b
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.03
per_output_token: 0.06
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 8192
description: |-
Lunaris 8B is a versatile generalist and roleplaying model based on Llama 3. It's a strategic merge of multiple models, designed to balance creativity with improved logic and general knowledge.
Created by [Sao10k](https://huggingface.co/Sao10k), this model aims to offer an improved experience over Stheno v3.2, with enhanced creativity and logical reasoning.
For best results, use with Llama 3 Instruct context template, temperature 1.4, and min_p 0.1.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: learnlm-1.5-pro-experimental:free
model_provider: google
inference_provider:
provider: openrouter
model_name: google/learnlm-1.5-pro-experimental:free
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.0
per_output_token: 0.0
valid_from: null
input_formats:
- text
- image
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 40960
description: An experimental version of [Gemini 1.5 Pro](/google/gemini-pro-1.5) from Google.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
structured_outputs:
default: false
description: If true, instructs the model to return structured outputs (e.g., in JSON format) using the response_format provided.
required: false
type: boolean
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: lfm-3b
model_provider: liquid
inference_provider:
provider: openrouter
model_name: liquid/lfm-3b
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.02
per_output_token: 0.02
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 32768
description: |-
Liquid's LFM 3B delivers incredible performance for its size. It positions itself as first place among 3B parameter transformers, hybrids, and RNN models It is also on par with Phi-3.5-mini on multiple benchmarks, while being 18.4% smaller.
LFM-3B is the ideal choice for mobile and other edge text-based applications.
See the [launch announcement](https://www.liquid.ai/liquid-foundation-models) for benchmarks and more info.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: lfm-40b
model_provider: liquid
inference_provider:
provider: openrouter
model_name: liquid/lfm-40b
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.15
per_output_token: 0.15
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 32768
description: |-
Liquid's 40.3B Mixture of Experts (MoE) model. Liquid Foundation Models (LFMs) are large neural networks built with computational units rooted in dynamic systems.
LFMs are general-purpose AI models that can be used to model any kind of sequential data, including video, audio, text, time series, and signals.
See the [launch announcement](https://www.liquid.ai/liquid-foundation-models) for benchmarks and more info.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: lfm-7b
model_provider: liquid
inference_provider:
provider: openrouter
model_name: liquid/lfm-7b
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.01
per_output_token: 0.01
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 32768
description: "LFM-7B, a new best-in-class language model. LFM-7B is designed for exceptional chat capabilities, including languages like Arabic and Japanese. Powered by the Liquid Foundation Model (LFM) architecture, it exhibits unique features like low memory footprint and fast inference speed. \n\nLFM-7B is the world’s best-in-class multilingual language model in English, Arabic, and Japanese.\n\nSee the [launch announcement](https://www.liquid.ai/lfm-7b) for benchmarks and more info."
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: llama-2-13b-chat
model_provider: meta-llama
inference_provider:
provider: openrouter
model_name: meta-llama/llama-2-13b-chat
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.22
per_output_token: 0.22
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 4096
description: A 13 billion parameter language model from Meta, fine tuned for chat completions
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: llama-2-70b-chat
model_provider: meta-llama
inference_provider:
provider: openrouter
model_name: meta-llama/llama-2-70b-chat
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.8999999999999999
per_output_token: 0.8999999999999999
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 4096
description: The flagship, 70 billion parameter language model from Meta, fine tuned for chat completions. Llama 2 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align to human preferences for helpfulness and safety.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: llama-3.1-405b
model_provider: meta-llama
inference_provider:
provider: openrouter
model_name: meta-llama/llama-3.1-405b
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 2.0
per_output_token: 2.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 32768
description: |-
Meta's latest class of model (Llama 3.1) launched with a variety of sizes & flavors. This is the base 405B pre-trained version.
It has demonstrated strong performance compared to leading closed-source models in human evaluations.
To read more about the model release, [click here](https://ai.meta.com/blog/meta-llama-3/). Usage of this model is subject to [Meta's Acceptable Use Policy](https://llama.meta.com/llama3/use-policy/).
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_logprobs:
default: null
description: Specifies the number of most likely tokens (from 0 to 20) to return at each token position, each with its associated log probability. (Requires that logprobs is enabled.)
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: llama-3.1-405b-instruct
model_provider: meta-llama
inference_provider:
provider: openrouter
model_name: meta-llama/llama-3.1-405b-instruct
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.7999999999999999
per_output_token: 0.7999999999999999
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 32768
description: |-
The highly anticipated 400B class of Llama3 is here! Clocking in at 128k context with impressive eval scores, the Meta AI team continues to push the frontier of open-source LLMs.
Meta's latest class of model (Llama 3.1) launched with a variety of sizes & flavors. This 405B instruct-tuned version is optimized for high quality dialogue usecases.
It has demonstrated strong performance compared to leading closed-source models including GPT-4o and Claude 3.5 Sonnet in evaluations.
To read more about the model release, [click here](https://ai.meta.com/blog/meta-llama-3-1/). Usage of this model is subject to [Meta's Acceptable Use Policy](https://llama.meta.com/llama3/use-policy/).
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: llama-3.1-405b-instruct:free
model_provider: meta-llama
inference_provider:
provider: openrouter
model_name: meta-llama/llama-3.1-405b-instruct:free
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.0
per_output_token: 0.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 8000
description: |-
The highly anticipated 400B class of Llama3 is here! Clocking in at 128k context with impressive eval scores, the Meta AI team continues to push the frontier of open-source LLMs.
Meta's latest class of model (Llama 3.1) launched with a variety of sizes & flavors. This 405B instruct-tuned version is optimized for high quality dialogue usecases.
It has demonstrated strong performance compared to leading closed-source models including GPT-4o and Claude 3.5 Sonnet in evaluations.
To read more about the model release, [click here](https://ai.meta.com/blog/meta-llama-3-1/). Usage of this model is subject to [Meta's Acceptable Use Policy](https://llama.meta.com/llama3/use-policy/).
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_logprobs:
default: null
description: Specifies the number of most likely tokens (from 0 to 20) to return at each token position, each with its associated log probability. (Requires that logprobs is enabled.)
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: llama-3.1-405b-instruct:nitro
model_provider: meta-llama
inference_provider:
provider: openrouter
model_name: meta-llama/llama-3.1-405b-instruct:nitro
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 14.62
per_output_token: 14.62
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 8000
description: |-
The highly anticipated 400B class of Llama3 is here! Clocking in at 128k context with impressive eval scores, the Meta AI team continues to push the frontier of open-source LLMs.
Meta's latest class of model (Llama 3.1) launched with a variety of sizes & flavors. This 405B instruct-tuned version is optimized for high quality dialogue usecases.
It has demonstrated strong performance compared to leading closed-source models including GPT-4o and Claude 3.5 Sonnet in evaluations.
To read more about the model release, [click here](https://ai.meta.com/blog/meta-llama-3-1/). Usage of this model is subject to [Meta's Acceptable Use Policy](https://llama.meta.com/llama3/use-policy/).
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_logprobs:
default: null
description: Specifies the number of most likely tokens (from 0 to 20) to return at each token position, each with its associated log probability. (Requires that logprobs is enabled.)
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: llama-3.1-70b-instruct
model_provider: meta-llama
inference_provider:
provider: openrouter
model_name: meta-llama/llama-3.1-70b-instruct
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.12
per_output_token: 0.3
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 131072
description: |-
Meta's latest class of model (Llama 3.1) launched with a variety of sizes & flavors. This 70B instruct-tuned version is optimized for high quality dialogue usecases.
It has demonstrated strong performance compared to leading closed-source models in human evaluations.
To read more about the model release, [click here](https://ai.meta.com/blog/meta-llama-3-1/). Usage of this model is subject to [Meta's Acceptable Use Policy](https://llama.meta.com/llama3/use-policy/).
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: llama-3.1-70b-instruct:free
model_provider: meta-llama
inference_provider:
provider: openrouter
model_name: meta-llama/llama-3.1-70b-instruct:free
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.0
per_output_token: 0.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 8192
description: |-
Meta's latest class of model (Llama 3.1) launched with a variety of sizes & flavors. This 70B instruct-tuned version is optimized for high quality dialogue usecases.
It has demonstrated strong performance compared to leading closed-source models in human evaluations.
To read more about the model release, [click here](https://ai.meta.com/blog/meta-llama-3-1/). Usage of this model is subject to [Meta's Acceptable Use Policy](https://llama.meta.com/llama3/use-policy/).
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_a:
default: 0.0
description: Consider only tokens with sufficiently high probabilities relative to the top token. A lower value focuses the selection on tokens near the top probability, acting like a dynamic Top-P filter.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: llama-3.1-70b-instruct:nitro
model_provider: meta-llama
inference_provider:
provider: openrouter
model_name: meta-llama/llama-3.1-70b-instruct:nitro
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 3.25
per_output_token: 3.25
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 64000
description: |-
Meta's latest class of model (Llama 3.1) launched with a variety of sizes & flavors. This 70B instruct-tuned version is optimized for high quality dialogue usecases.
It has demonstrated strong performance compared to leading closed-source models in human evaluations.
To read more about the model release, [click here](https://ai.meta.com/blog/meta-llama-3-1/). Usage of this model is subject to [Meta's Acceptable Use Policy](https://llama.meta.com/llama3/use-policy/).
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_logprobs:
default: null
description: Specifies the number of most likely tokens (from 0 to 20) to return at each token position, each with its associated log probability. (Requires that logprobs is enabled.)
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: llama-3.1-8b-instruct
model_provider: meta-llama
inference_provider:
provider: openrouter
model_name: meta-llama/llama-3.1-8b-instruct
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.02
per_output_token: 0.05
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 131072
description: |-
Meta's latest class of model (Llama 3.1) launched with a variety of sizes & flavors. This 8B instruct-tuned version is fast and efficient.
It has demonstrated strong performance compared to leading closed-source models in human evaluations.
To read more about the model release, [click here](https://ai.meta.com/blog/meta-llama-3-1/). Usage of this model is subject to [Meta's Acceptable Use Policy](https://llama.meta.com/llama3/use-policy/).
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: llama-3.1-8b-instruct:free
model_provider: meta-llama
inference_provider:
provider: openrouter
model_name: meta-llama/llama-3.1-8b-instruct:free
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.0
per_output_token: 0.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 8192
description: |-
Meta's latest class of model (Llama 3.1) launched with a variety of sizes & flavors. This 8B instruct-tuned version is fast and efficient.
It has demonstrated strong performance compared to leading closed-source models in human evaluations.
To read more about the model release, [click here](https://ai.meta.com/blog/meta-llama-3-1/). Usage of this model is subject to [Meta's Acceptable Use Policy](https://llama.meta.com/llama3/use-policy/).
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_logprobs:
default: null
description: Specifies the number of most likely tokens (from 0 to 20) to return at each token position, each with its associated log probability. (Requires that logprobs is enabled.)
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: llama-3.1-8b-instruct:nitro
model_provider: meta-llama
inference_provider:
provider: openrouter
model_name: meta-llama/llama-3.1-8b-instruct:nitro
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.18
per_output_token: 0.18
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 131072
description: |-
Meta's latest class of model (Llama 3.1) launched with a variety of sizes & flavors. This 8B instruct-tuned version is fast and efficient.
It has demonstrated strong performance compared to leading closed-source models in human evaluations.
To read more about the model release, [click here](https://ai.meta.com/blog/meta-llama-3-1/). Usage of this model is subject to [Meta's Acceptable Use Policy](https://llama.meta.com/llama3/use-policy/).
parameters: null
- model: llama-3.1-lumimaid-70b
model_provider: neversleep
inference_provider:
provider: openrouter
model_name: neversleep/llama-3.1-lumimaid-70b
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 3.375
per_output_token: 4.5
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 16384
description: |-
Lumimaid v0.2 70B is a finetune of [Llama 3.1 70B](/meta-llama/llama-3.1-70b-instruct) with a "HUGE step up dataset wise" compared to Lumimaid v0.1. Sloppy chats output were purged.
Usage of this model is subject to [Meta's Acceptable Use Policy](https://llama.meta.com/llama3/use-policy/).
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_a:
default: 0.0
description: Consider only tokens with sufficiently high probabilities relative to the top token. A lower value focuses the selection on tokens near the top probability, acting like a dynamic Top-P filter.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: llama-3.1-lumimaid-8b
model_provider: neversleep
inference_provider:
provider: openrouter
model_name: neversleep/llama-3.1-lumimaid-8b
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.1875
per_output_token: 1.125
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 32768
description: |-
Lumimaid v0.2 8B is a finetune of [Llama 3.1 8B](/models/meta-llama/llama-3.1-8b-instruct) with a "HUGE step up dataset wise" compared to Lumimaid v0.1. Sloppy chats output were purged.
Usage of this model is subject to [Meta's Acceptable Use Policy](https://llama.meta.com/llama3/use-policy/).
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_a:
default: 0.0
description: Consider only tokens with sufficiently high probabilities relative to the top token. A lower value focuses the selection on tokens near the top probability, acting like a dynamic Top-P filter.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: llama-3.1-nemotron-70b-instruct
model_provider: nvidia
inference_provider:
provider: openrouter
model_name: nvidia/llama-3.1-nemotron-70b-instruct
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.12
per_output_token: 0.3
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 131000
description: |-
NVIDIA's Llama 3.1 Nemotron 70B is a language model designed for generating precise and useful responses. Leveraging [Llama 3.1 70B](/models/meta-llama/llama-3.1-70b-instruct) architecture and Reinforcement Learning from Human Feedback (RLHF), it excels in automatic alignment benchmarks. This model is tailored for applications requiring high accuracy in helpfulness and response generation, suitable for diverse user queries across multiple domains.
Usage of this model is subject to [Meta's Acceptable Use Policy](https://www.llama.com/llama3/use-policy/).
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_logprobs:
default: null
description: Specifies the number of most likely tokens (from 0 to 20) to return at each token position, each with its associated log probability. (Requires that logprobs is enabled.)
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: llama-3.1-nemotron-70b-instruct:free
model_provider: nvidia
inference_provider:
provider: openrouter
model_name: nvidia/llama-3.1-nemotron-70b-instruct:free
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.0
per_output_token: 0.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 131072
description: |-
NVIDIA's Llama 3.1 Nemotron 70B is a language model designed for generating precise and useful responses. Leveraging [Llama 3.1 70B](/models/meta-llama/llama-3.1-70b-instruct) architecture and Reinforcement Learning from Human Feedback (RLHF), it excels in automatic alignment benchmarks. This model is tailored for applications requiring high accuracy in helpfulness and response generation, suitable for diverse user queries across multiple domains.
Usage of this model is subject to [Meta's Acceptable Use Policy](https://www.llama.com/llama3/use-policy/).
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_a:
default: 0.0
description: Consider only tokens with sufficiently high probabilities relative to the top token. A lower value focuses the selection on tokens near the top probability, acting like a dynamic Top-P filter.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: llama-3.1-sonar-huge-128k-online
model_provider: perplexity
inference_provider:
provider: openrouter
model_name: perplexity/llama-3.1-sonar-huge-128k-online
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 5.0
per_output_token: 5.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 127072
description: Llama 3.1 Sonar is Perplexity's latest model family. It surpasses their earlier Sonar models in cost-efficiency, speed, and performance. The model is built upon the Llama 3.1 405B and has internet access.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: llama-3.1-sonar-large-128k-chat
model_provider: perplexity
inference_provider:
provider: openrouter
model_name: perplexity/llama-3.1-sonar-large-128k-chat
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 1.0
per_output_token: 1.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 131072
description: |-
Llama 3.1 Sonar is Perplexity's latest model family. It surpasses their earlier Sonar models in cost-efficiency, speed, and performance.
This is a normal offline LLM, but the [online version](/models/perplexity/llama-3.1-sonar-large-128k-online) of this model has Internet access.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: llama-3.1-sonar-large-128k-online
model_provider: perplexity
inference_provider:
provider: openrouter
model_name: perplexity/llama-3.1-sonar-large-128k-online
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 1.0
per_output_token: 1.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 127072
description: |-
Llama 3.1 Sonar is Perplexity's latest model family. It surpasses their earlier Sonar models in cost-efficiency, speed, and performance.
This is the online version of the [offline chat model](/models/perplexity/llama-3.1-sonar-large-128k-chat). It is focused on delivering helpful, up-to-date, and factual responses. #online
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: llama-3.1-sonar-small-128k-chat
model_provider: perplexity
inference_provider:
provider: openrouter
model_name: perplexity/llama-3.1-sonar-small-128k-chat
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.2
per_output_token: 0.2
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 131072
description: |-
Llama 3.1 Sonar is Perplexity's latest model family. It surpasses their earlier Sonar models in cost-efficiency, speed, and performance.
This is a normal offline LLM, but the [online version](/models/perplexity/llama-3.1-sonar-small-128k-online) of this model has Internet access.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: llama-3.1-sonar-small-128k-online
model_provider: perplexity
inference_provider:
provider: openrouter
model_name: perplexity/llama-3.1-sonar-small-128k-online
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.2
per_output_token: 0.2
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 127072
description: |-
Llama 3.1 Sonar is Perplexity's latest model family. It surpasses their earlier Sonar models in cost-efficiency, speed, and performance.
This is the online version of the [offline chat model](/models/perplexity/llama-3.1-sonar-small-128k-chat). It is focused on delivering helpful, up-to-date, and factual responses. #online
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: llama-3.2-11b-vision-instruct
model_provider: meta-llama
inference_provider:
provider: openrouter
model_name: meta-llama/llama-3.2-11b-vision-instruct
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.055
per_output_token: 0.055
valid_from: null
input_formats:
- text
- image
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 131072
description: |-
Llama 3.2 11B Vision is a multimodal model with 11 billion parameters, designed to handle tasks combining visual and textual data. It excels in tasks such as image captioning and visual question answering, bridging the gap between language generation and visual reasoning. Pre-trained on a massive dataset of image-text pairs, it performs well in complex, high-accuracy image analysis.
Its ability to integrate visual understanding with language processing makes it an ideal solution for industries requiring comprehensive visual-linguistic AI applications, such as content creation, AI-driven customer service, and research.
Click here for the [original model card](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/MODEL_CARD_VISION.md).
Usage of this model is subject to [Meta's Acceptable Use Policy](https://www.llama.com/llama3/use-policy/).
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_logprobs:
default: null
description: Specifies the number of most likely tokens (from 0 to 20) to return at each token position, each with its associated log probability. (Requires that logprobs is enabled.)
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: llama-3.2-11b-vision-instruct:free
model_provider: meta-llama
inference_provider:
provider: openrouter
model_name: meta-llama/llama-3.2-11b-vision-instruct:free
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.0
per_output_token: 0.0
valid_from: null
input_formats:
- text
- image
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 131072
description: |-
Llama 3.2 11B Vision is a multimodal model with 11 billion parameters, designed to handle tasks combining visual and textual data. It excels in tasks such as image captioning and visual question answering, bridging the gap between language generation and visual reasoning. Pre-trained on a massive dataset of image-text pairs, it performs well in complex, high-accuracy image analysis.
Its ability to integrate visual understanding with language processing makes it an ideal solution for industries requiring comprehensive visual-linguistic AI applications, such as content creation, AI-driven customer service, and research.
Click here for the [original model card](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/MODEL_CARD_VISION.md).
Usage of this model is subject to [Meta's Acceptable Use Policy](https://www.llama.com/llama3/use-policy/).
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: llama-3.2-1b-instruct
model_provider: meta-llama
inference_provider:
provider: openrouter
model_name: meta-llama/llama-3.2-1b-instruct
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.01
per_output_token: 0.01
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 131072
description: |-
Llama 3.2 1B is a 1-billion-parameter language model focused on efficiently performing natural language tasks, such as summarization, dialogue, and multilingual text analysis. Its smaller size allows it to operate efficiently in low-resource environments while maintaining strong task performance.
Supporting eight core languages and fine-tunable for more, Llama 1.3B is ideal for businesses or developers seeking lightweight yet powerful AI solutions that can operate in diverse multilingual settings without the high computational demand of larger models.
Click here for the [original model card](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/MODEL_CARD.md).
Usage of this model is subject to [Meta's Acceptable Use Policy](https://www.llama.com/llama3/use-policy/).
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: llama-3.2-1b-instruct:free
model_provider: meta-llama
inference_provider:
provider: openrouter
model_name: meta-llama/llama-3.2-1b-instruct:free
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.0
per_output_token: 0.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 4096
description: |-
Llama 3.2 1B is a 1-billion-parameter language model focused on efficiently performing natural language tasks, such as summarization, dialogue, and multilingual text analysis. Its smaller size allows it to operate efficiently in low-resource environments while maintaining strong task performance.
Supporting eight core languages and fine-tunable for more, Llama 1.3B is ideal for businesses or developers seeking lightweight yet powerful AI solutions that can operate in diverse multilingual settings without the high computational demand of larger models.
Click here for the [original model card](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/MODEL_CARD.md).
Usage of this model is subject to [Meta's Acceptable Use Policy](https://www.llama.com/llama3/use-policy/).
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: llama-3.2-3b-instruct
model_provider: meta-llama
inference_provider:
provider: openrouter
model_name: meta-llama/llama-3.2-3b-instruct
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.015
per_output_token: 0.025
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 131000
description: |-
Llama 3.2 3B is a 3-billion-parameter multilingual large language model, optimized for advanced natural language processing tasks like dialogue generation, reasoning, and summarization. Designed with the latest transformer architecture, it supports eight languages, including English, Spanish, and Hindi, and is adaptable for additional languages.
Trained on 9 trillion tokens, the Llama 3.2 3B model excels in instruction-following, complex reasoning, and tool use. Its balanced performance makes it ideal for applications needing accuracy and efficiency in text generation across multilingual settings.
Click here for the [original model card](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/MODEL_CARD.md).
Usage of this model is subject to [Meta's Acceptable Use Policy](https://www.llama.com/llama3/use-policy/).
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_logprobs:
default: null
description: Specifies the number of most likely tokens (from 0 to 20) to return at each token position, each with its associated log probability. (Requires that logprobs is enabled.)
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: llama-3.2-3b-instruct:free
model_provider: meta-llama
inference_provider:
provider: openrouter
model_name: meta-llama/llama-3.2-3b-instruct:free
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.0
per_output_token: 0.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 4096
description: |-
Llama 3.2 3B is a 3-billion-parameter multilingual large language model, optimized for advanced natural language processing tasks like dialogue generation, reasoning, and summarization. Designed with the latest transformer architecture, it supports eight languages, including English, Spanish, and Hindi, and is adaptable for additional languages.
Trained on 9 trillion tokens, the Llama 3.2 3B model excels in instruction-following, complex reasoning, and tool use. Its balanced performance makes it ideal for applications needing accuracy and efficiency in text generation across multilingual settings.
Click here for the [original model card](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/MODEL_CARD.md).
Usage of this model is subject to [Meta's Acceptable Use Policy](https://www.llama.com/llama3/use-policy/).
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_logprobs:
default: null
description: Specifies the number of most likely tokens (from 0 to 20) to return at each token position, each with its associated log probability. (Requires that logprobs is enabled.)
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: llama-3.2-90b-vision-instruct
model_provider: meta-llama
inference_provider:
provider: openrouter
model_name: meta-llama/llama-3.2-90b-vision-instruct
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.8999999999999999
per_output_token: 0.8999999999999999
valid_from: null
input_formats:
- text
- image
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 131072
description: |-
The Llama 90B Vision model is a top-tier, 90-billion-parameter multimodal model designed for the most challenging visual reasoning and language tasks. It offers unparalleled accuracy in image captioning, visual question answering, and advanced image-text comprehension. Pre-trained on vast multimodal datasets and fine-tuned with human feedback, the Llama 90B Vision is engineered to handle the most demanding image-based AI tasks.
This model is perfect for industries requiring cutting-edge multimodal AI capabilities, particularly those dealing with complex, real-time visual and textual analysis.
Click here for the [original model card](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/MODEL_CARD_VISION.md).
Usage of this model is subject to [Meta's Acceptable Use Policy](https://www.llama.com/llama3/use-policy/).
parameters:
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: llama-3.2-90b-vision-instruct:free
model_provider: meta-llama
inference_provider:
provider: openrouter
model_name: meta-llama/llama-3.2-90b-vision-instruct:free
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.0
per_output_token: 0.0
valid_from: null
input_formats:
- text
- image
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 4096
description: |-
The Llama 90B Vision model is a top-tier, 90-billion-parameter multimodal model designed for the most challenging visual reasoning and language tasks. It offers unparalleled accuracy in image captioning, visual question answering, and advanced image-text comprehension. Pre-trained on vast multimodal datasets and fine-tuned with human feedback, the Llama 90B Vision is engineered to handle the most demanding image-based AI tasks.
This model is perfect for industries requiring cutting-edge multimodal AI capabilities, particularly those dealing with complex, real-time visual and textual analysis.
Click here for the [original model card](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/MODEL_CARD_VISION.md).
Usage of this model is subject to [Meta's Acceptable Use Policy](https://www.llama.com/llama3/use-policy/).
parameters:
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: llama-3.3-70b-instruct
model_provider: meta-llama
inference_provider:
provider: openrouter
model_name: meta-llama/llama-3.3-70b-instruct
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.12
per_output_token: 0.3
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 131072
description: |-
The Meta Llama 3.3 multilingual large language model (LLM) is a pretrained and instruction tuned generative model in 70B (text in/text out). The Llama 3.3 instruction tuned text only model is optimized for multilingual dialogue use cases and outperforms many of the available open source and closed chat models on common industry benchmarks.
Supported languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.
[Model Card](https://github.com/meta-llama/llama-models/blob/main/models/llama3_3/MODEL_CARD.md)
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_logprobs:
default: null
description: Specifies the number of most likely tokens (from 0 to 20) to return at each token position, each with its associated log probability. (Requires that logprobs is enabled.)
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: llama-3-70b-instruct
model_provider: meta-llama
inference_provider:
provider: openrouter
model_name: meta-llama/llama-3-70b-instruct
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.23
per_output_token: 0.4
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 8192
description: |-
Meta's latest class of model (Llama 3) launched with a variety of sizes & flavors. This 70B instruct-tuned version was optimized for high quality dialogue usecases.
It has demonstrated strong performance compared to leading closed-source models in human evaluations.
To read more about the model release, [click here](https://ai.meta.com/blog/meta-llama-3/). Usage of this model is subject to [Meta's Acceptable Use Policy](https://llama.meta.com/llama3/use-policy/).
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: llama-3-70b-instruct:nitro
model_provider: meta-llama
inference_provider:
provider: openrouter
model_name: meta-llama/llama-3-70b-instruct:nitro
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.88
per_output_token: 0.88
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 8192
description: |-
Meta's latest class of model (Llama 3) launched with a variety of sizes & flavors. This 70B instruct-tuned version was optimized for high quality dialogue usecases.
It has demonstrated strong performance compared to leading closed-source models in human evaluations.
To read more about the model release, [click here](https://ai.meta.com/blog/meta-llama-3/). Usage of this model is subject to [Meta's Acceptable Use Policy](https://llama.meta.com/llama3/use-policy/).
parameters: null
- model: llama-3-8b-instruct
model_provider: meta-llama
inference_provider:
provider: openrouter
model_name: meta-llama/llama-3-8b-instruct
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.03
per_output_token: 0.06
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 8192
description: |-
Meta's latest class of model (Llama 3) launched with a variety of sizes & flavors. This 8B instruct-tuned version was optimized for high quality dialogue usecases.
It has demonstrated strong performance compared to leading closed-source models in human evaluations.
To read more about the model release, [click here](https://ai.meta.com/blog/meta-llama-3/). Usage of this model is subject to [Meta's Acceptable Use Policy](https://llama.meta.com/llama3/use-policy/).
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: llama-3-8b-instruct:extended
model_provider: meta-llama
inference_provider:
provider: openrouter
model_name: meta-llama/llama-3-8b-instruct:extended
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.1875
per_output_token: 1.125
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 16384
description: |-
Meta's latest class of model (Llama 3) launched with a variety of sizes & flavors. This 8B instruct-tuned version was optimized for high quality dialogue usecases.
It has demonstrated strong performance compared to leading closed-source models in human evaluations.
To read more about the model release, [click here](https://ai.meta.com/blog/meta-llama-3/). Usage of this model is subject to [Meta's Acceptable Use Policy](https://llama.meta.com/llama3/use-policy/).
parameters: null
- model: llama-3-8b-instruct:free
model_provider: meta-llama
inference_provider:
provider: openrouter
model_name: meta-llama/llama-3-8b-instruct:free
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.0
per_output_token: 0.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 8192
description: |-
Meta's latest class of model (Llama 3) launched with a variety of sizes & flavors. This 8B instruct-tuned version was optimized for high quality dialogue usecases.
It has demonstrated strong performance compared to leading closed-source models in human evaluations.
To read more about the model release, [click here](https://ai.meta.com/blog/meta-llama-3/). Usage of this model is subject to [Meta's Acceptable Use Policy](https://llama.meta.com/llama3/use-policy/).
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: llama-3-8b-instruct:nitro
model_provider: meta-llama
inference_provider:
provider: openrouter
model_name: meta-llama/llama-3-8b-instruct:nitro
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.2
per_output_token: 0.2
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 8192
description: |-
Meta's latest class of model (Llama 3) launched with a variety of sizes & flavors. This 8B instruct-tuned version was optimized for high quality dialogue usecases.
It has demonstrated strong performance compared to leading closed-source models in human evaluations.
To read more about the model release, [click here](https://ai.meta.com/blog/meta-llama-3/). Usage of this model is subject to [Meta's Acceptable Use Policy](https://llama.meta.com/llama3/use-policy/).
parameters: null
- model: llama-3-lumimaid-70b
model_provider: neversleep
inference_provider:
provider: openrouter
model_name: neversleep/llama-3-lumimaid-70b
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 3.375
per_output_token: 4.5
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 8192
description: |-
The NeverSleep team is back, with a Llama 3 70B finetune trained on their curated roleplay data. Striking a balance between eRP and RP, Lumimaid was designed to be serious, yet uncensored when necessary.
To enhance it's overall intelligence and chat capability, roughly 40% of the training data was not roleplay. This provides a breadth of knowledge to access, while still keeping roleplay as the primary strength.
Usage of this model is subject to [Meta's Acceptable Use Policy](https://llama.meta.com/llama3/use-policy/).
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_a:
default: 0.0
description: Consider only tokens with sufficiently high probabilities relative to the top token. A lower value focuses the selection on tokens near the top probability, acting like a dynamic Top-P filter.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: llama-3-lumimaid-8b
model_provider: neversleep
inference_provider:
provider: openrouter
model_name: neversleep/llama-3-lumimaid-8b
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.1875
per_output_token: 1.125
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 24576
description: |-
The NeverSleep team is back, with a Llama 3 8B finetune trained on their curated roleplay data. Striking a balance between eRP and RP, Lumimaid was designed to be serious, yet uncensored when necessary.
To enhance it's overall intelligence and chat capability, roughly 40% of the training data was not roleplay. This provides a breadth of knowledge to access, while still keeping roleplay as the primary strength.
Usage of this model is subject to [Meta's Acceptable Use Policy](https://llama.meta.com/llama3/use-policy/).
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_a:
default: 0.0
description: Consider only tokens with sufficiently high probabilities relative to the top token. A lower value focuses the selection on tokens near the top probability, acting like a dynamic Top-P filter.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: llama-3-lumimaid-8b:extended
model_provider: neversleep
inference_provider:
provider: openrouter
model_name: neversleep/llama-3-lumimaid-8b:extended
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.1875
per_output_token: 1.125
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 24576
description: |-
The NeverSleep team is back, with a Llama 3 8B finetune trained on their curated roleplay data. Striking a balance between eRP and RP, Lumimaid was designed to be serious, yet uncensored when necessary.
To enhance it's overall intelligence and chat capability, roughly 40% of the training data was not roleplay. This provides a breadth of knowledge to access, while still keeping roleplay as the primary strength.
Usage of this model is subject to [Meta's Acceptable Use Policy](https://llama.meta.com/llama3/use-policy/).
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_a:
default: 0.0
description: Consider only tokens with sufficiently high probabilities relative to the top token. A lower value focuses the selection on tokens near the top probability, acting like a dynamic Top-P filter.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: llama-guard-2-8b
model_provider: meta-llama
inference_provider:
provider: openrouter
model_name: meta-llama/llama-guard-2-8b
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.2
per_output_token: 0.2
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 8192
description: |-
This safeguard model has 8B parameters and is based on the Llama 3 family. Just like is predecessor, [LlamaGuard 1](https://huggingface.co/meta-llama/LlamaGuard-7b), it can do both prompt and response classification.
LlamaGuard 2 acts as a normal LLM would, generating text that indicates whether the given input/output is safe/unsafe. If deemed unsafe, it will also share the content categories violated.
For best results, please use raw prompt input or the `/completions` endpoint, instead of the chat API.
It has demonstrated strong performance compared to leading closed-source models in human evaluations.
To read more about the model release, [click here](https://ai.meta.com/blog/meta-llama-3/). Usage of this model is subject to [Meta's Acceptable Use Policy](https://llama.meta.com/llama3/use-policy/).
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: magnum-72b
model_provider: alpindale
inference_provider:
provider: openrouter
model_name: alpindale/magnum-72b
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 1.875
per_output_token: 2.25
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 16384
description: |-
From the maker of [Goliath](https://openrouter.ai/models/alpindale/goliath-120b), Magnum 72B is the first in a new family of models designed to achieve the prose quality of the Claude 3 models, notably Opus & Sonnet.
The model is based on [Qwen2 72B](https://openrouter.ai/models/qwen/qwen-2-72b-instruct) and trained with 55 million tokens of highly curated roleplay (RP) data.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_a:
default: 0.0
description: Consider only tokens with sufficiently high probabilities relative to the top token. A lower value focuses the selection on tokens near the top probability, acting like a dynamic Top-P filter.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: magnum-v2-72b
model_provider: anthracite-org
inference_provider:
provider: openrouter
model_name: anthracite-org/magnum-v2-72b
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 3.0
per_output_token: 3.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 32768
description: |-
From the maker of [Goliath](https://openrouter.ai/models/alpindale/goliath-120b), Magnum 72B is the seventh in a family of models designed to achieve the prose quality of the Claude 3 models, notably Opus & Sonnet.
The model is based on [Qwen2 72B](https://openrouter.ai/models/qwen/qwen-2-72b-instruct) and trained with 55 million tokens of highly curated roleplay (RP) data.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: magnum-v4-72b
model_provider: anthracite-org
inference_provider:
provider: openrouter
model_name: anthracite-org/magnum-v4-72b
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 1.875
per_output_token: 2.25
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 16384
description: |-
This is a series of models designed to replicate the prose quality of the Claude 3 models, specifically Sonnet(https://openrouter.ai/anthropic/claude-3.5-sonnet) and Opus(https://openrouter.ai/anthropic/claude-3-opus).
The model is fine-tuned on top of [Qwen2.5 72B](https://openrouter.ai/qwen/qwen-2.5-72b-instruct).
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_a:
default: 0.0
description: Consider only tokens with sufficiently high probabilities relative to the top token. A lower value focuses the selection on tokens near the top probability, acting like a dynamic Top-P filter.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: midnight-rose-70b
model_provider: sophosympatheia
inference_provider:
provider: openrouter
model_name: sophosympatheia/midnight-rose-70b
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.7999999999999999
per_output_token: 0.7999999999999999
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 4096
description: |-
A merge with a complex family tree, this model was crafted for roleplaying and storytelling. Midnight Rose is a successor to Rogue Rose and Aurora Nights and improves upon them both. It wants to produce lengthy output by default and is the best creative writing merge produced so far by sophosympatheia.
Descending from earlier versions of Midnight Rose and [Wizard Tulu Dolphin 70B](https://huggingface.co/sophosympatheia/Wizard-Tulu-Dolphin-70B-v1.0), it inherits the best qualities of each.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: minimax-01
model_provider: minimax
inference_provider:
provider: openrouter
model_name: minimax/minimax-01
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.2
per_output_token: 1.1
valid_from: null
input_formats:
- text
- image
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 1000192
description: |-
MiniMax-01 is a combines MiniMax-Text-01 for text generation and MiniMax-VL-01 for image understanding. It has 456 billion parameters, with 45.9 billion parameters activated per inference, and can handle a context of up to 4 million tokens.
The text model adopts a hybrid architecture that combines Lightning Attention, Softmax Attention, and Mixture-of-Experts (MoE). The image model adopts the “ViT-MLP-LLM” framework and is trained on top of the text model.
To read more about the release, see: https://www.minimaxi.com/en/news/minimax-01-series-2
parameters:
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: ministral-3b
model_provider: mistralai
inference_provider:
provider: openrouter
model_name: mistralai/ministral-3b
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.04
per_output_token: 0.04
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 128000
description: Ministral 3B is a 3B parameter model optimized for on-device and edge computing. It excels in knowledge, commonsense reasoning, and function-calling, outperforming larger models like Mistral 7B on most benchmarks. Supporting up to 128k context length, it’s ideal for orchestrating agentic workflows and specialist tasks with efficient inference.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: ministral-8b
model_provider: mistralai
inference_provider:
provider: openrouter
model_name: mistralai/ministral-8b
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.1
per_output_token: 0.1
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 128000
description: Ministral 8B is an 8B parameter model featuring a unique interleaved sliding-window attention pattern for faster, memory-efficient inference. Designed for edge use cases, it supports up to 128k context length and excels in knowledge and reasoning tasks. It outperforms peers in the sub-10B category, making it perfect for low-latency, privacy-first applications.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: mistral-7b-instruct
model_provider: mistralai
inference_provider:
provider: openrouter
model_name: mistralai/mistral-7b-instruct
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.03
per_output_token: 0.055
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 32768
description: |-
A high-performing, industry-standard 7.3B parameter model, with optimizations for speed and context length.
*Mistral 7B Instruct has multiple version variants, and this is intended to be the latest version.*
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: mistral-7b-instruct:free
model_provider: mistralai
inference_provider:
provider: openrouter
model_name: mistralai/mistral-7b-instruct:free
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.0
per_output_token: 0.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 8192
description: |-
A high-performing, industry-standard 7.3B parameter model, with optimizations for speed and context length.
*Mistral 7B Instruct has multiple version variants, and this is intended to be the latest version.*
parameters:
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: mistral-7b-instruct:nitro
model_provider: mistralai
inference_provider:
provider: openrouter
model_name: mistralai/mistral-7b-instruct:nitro
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.07
per_output_token: 0.07
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 32768
description: |-
A high-performing, industry-standard 7.3B parameter model, with optimizations for speed and context length.
*Mistral 7B Instruct has multiple version variants, and this is intended to be the latest version.*
parameters:
frequency_penalty:
default: 0
description: frequency_penalty penalizes the repetition of words based on their frequency in the generated text. A higher frequency penalty discourages the model from repeating words that have already appeared frequently in the output, promoting diversity and reducing repetition.
max: 2.0
min: -2.0
required: false
type: number
max_tokens:
description: Max Tokens (integer) or Max Tokens (null) (Max Tokens).The maximum number of tokens to generate in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
required: false
type: int
n:
default: 1
description: Number of completions to return for each request, input tokens are only billed once.
required: false
type: int
prediction:
description: ' Example: {{type: content,content: json_object}} . Enable users to specify expected results, optimizing response times by leveraging known or predictable content. This approach is especially effective for updating text documents or code files with minimal changes, reducing latency while maintaining high-quality result'
required: false
type: object
presence_penalty:
default: 0
description: presence_penalty determines how much the model penalizes the repetition of words or phrases. A higher presence penalty encourages the model to use a wider variety of words and phrases, making the output more diverse and creative.
max: 2.0
min: -2.0
required: false
type: number
random_seed:
default: null
description: The seed to use for random sampling. If set, different calls will generate deterministic results
max: null
min: null
required: false
type: int
response_format:
default: null
description: 'An object specifying the format that the model must output. Setting to type: json_object enables JSON mode, which guarantees the message the model generates is in JSON. When using JSON mode you MUST also instruct the model to produce JSON yourself with a system or a user message.'
required: false
type: object
safe_prompt:
default: false
description: Whether to inject a safety prompt before all conversations.
required: false
type: boolean
stop:
default: null
description: Stop generation if this token is detected. Or if one of these tokens is detected when providing an array
max: null
min: null
required: false
type: string/array
temperature:
description: Temperature (number) or Temperature (null) (Temperature). What sampling temperature to use, we recommend between 0.0 and 0.7. Higher values like 0.7 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both. The default value varies depending on the model you are targeting. Call the /models endpoint to retrieve the appropriate value
max: 0.7
min: 0.1
required: false
type: number
top_p:
default: 1
description: Nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
type: number
- model: mistral-7b-instruct-v0.1
model_provider: mistralai
inference_provider:
provider: openrouter
model_name: mistralai/mistral-7b-instruct-v0.1
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.2
per_output_token: 0.2
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 32768
description: A 7.3B parameter model that outperforms Llama 2 13B on all benchmarks, with optimizations for speed and context length.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: mistral-7b-instruct-v0.3
model_provider: mistralai
inference_provider:
provider: openrouter
model_name: mistralai/mistral-7b-instruct-v0.3
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.03
per_output_token: 0.055
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 32768
description: |-
A high-performing, industry-standard 7.3B parameter model, with optimizations for speed and context length.
An improved version of [Mistral 7B Instruct v0.2](/models/mistralai/mistral-7b-instruct-v0.2), with the following changes:
- Extended vocabulary to 32768
- Supports v3 Tokenizer
- Supports function calling
NOTE: Support for function calling depends on the provider.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: mistral-large
model_provider: mistralai
inference_provider:
provider: openrouter
model_name: mistralai/mistral-large
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 2.0
per_output_token: 6.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 128000
description: |-
This is Mistral AI's flagship model, Mistral Large 2 (version `mistral-large-2407`). It's a proprietary weights-available model and excels at reasoning, code, JSON, chat, and more. Read the launch announcement [here](https://mistral.ai/news/mistral-large-2407/).
It supports dozens of languages including French, German, Spanish, Italian, Portuguese, Arabic, Hindi, Russian, Chinese, Japanese, and Korean, along with 80+ coding languages including Python, Java, C, C++, JavaScript, and Bash. Its long context window allows precise information recall from large documents.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: mistral-large-2407
model_provider: mistralai
inference_provider:
provider: openrouter
model_name: mistralai/mistral-large-2407
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 2.0
per_output_token: 6.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 128000
description: |
This is Mistral AI's flagship model, Mistral Large 2 (version mistral-large-2407). It's a proprietary weights-available model and excels at reasoning, code, JSON, chat, and more. Read the launch announcement [here](https://mistral.ai/news/mistral-large-2407/).
It supports dozens of languages including French, German, Spanish, Italian, Portuguese, Arabic, Hindi, Russian, Chinese, Japanese, and Korean, along with 80+ coding languages including Python, Java, C, C++, JavaScript, and Bash. Its long context window allows precise information recall from large documents.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: mistral-large-2411
model_provider: mistralai
inference_provider:
provider: openrouter
model_name: mistralai/mistral-large-2411
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 2.0
per_output_token: 6.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 128000
description: |-
Mistral Large 2 2411 is an update of [Mistral Large 2](/mistralai/mistral-large) released together with [Pixtral Large 2411](/mistralai/pixtral-large-2411)
It provides a significant upgrade on the previous [Mistral Large 24.07](/mistralai/mistral-large-2407), with notable improvements in long context understanding, a new system prompt, and more accurate function calling.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: mistral-medium
model_provider: mistralai
inference_provider:
provider: openrouter
model_name: mistralai/mistral-medium
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 2.75
per_output_token: 8.1
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 32000
description: This is Mistral AI's closed-source, medium-sided model. It's powered by a closed-source prototype and excels at reasoning, code, JSON, chat, and more. In benchmarks, it compares with many of the flagship models of other companies.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: mistral-nemo
model_provider: mistralai
inference_provider:
provider: openrouter
model_name: mistralai/mistral-nemo
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.035
per_output_token: 0.08
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 131072
description: |-
A 12B parameter model with a 128k token context length built by Mistral in collaboration with NVIDIA.
The model is multilingual, supporting English, French, German, Spanish, Italian, Portuguese, Chinese, Japanese, Korean, Arabic, and Hindi.
It supports function calling and is released under the Apache 2.0 license.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: mistral-small
model_provider: mistralai
inference_provider:
provider: openrouter
model_name: mistralai/mistral-small
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.2
per_output_token: 0.6
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 32000
description: With 22 billion parameters, Mistral Small v24.09 offers a convenient mid-point between (Mistral NeMo 12B)[/mistralai/mistral-nemo] and (Mistral Large 2)[/mistralai/mistral-large], providing a cost-effective solution that can be deployed across various platforms and environments. It has better reasoning, exhibits more capabilities, can produce and reason about code, and is multiligual, supporting English, French, German, Italian, and Spanish.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: mistral-small-24b-instruct-2501
model_provider: mistralai
inference_provider:
provider: openrouter
model_name: mistralai/mistral-small-24b-instruct-2501
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.07
per_output_token: 0.14
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 32768
description: |-
Mistral Small 3 is a 24B-parameter language model optimized for low-latency performance across common AI tasks. Released under the Apache 2.0 license, it features both pre-trained and instruction-tuned versions designed for efficient local deployment.
The model achieves 81% accuracy on the MMLU benchmark and performs competitively with larger models like Llama 3.3 70B and Qwen 32B, while operating at three times the speed on equivalent hardware. [Read the blog post about the model here.](https://mistral.ai/news/mistral-small-3/)
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: mistral-tiny
model_provider: mistralai
inference_provider:
provider: openrouter
model_name: mistralai/mistral-tiny
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.25
per_output_token: 0.25
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 32000
description: This model is currently powered by Mistral-7B-v0.2, and incorporates a "better" fine-tuning than [Mistral 7B](/models/mistralai/mistral-7b-instruct-v0.1), inspired by community work. It's best used for large batch processing tasks where cost is a significant factor but reasoning capabilities are not crucial.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: mixtral-8x22b-instruct
model_provider: mistralai
inference_provider:
provider: openrouter
model_name: mistralai/mixtral-8x22b-instruct
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.8999999999999999
per_output_token: 0.8999999999999999
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 65536
description: |-
Mistral's official instruct fine-tuned version of [Mixtral 8x22B](/models/mistralai/mixtral-8x22b). It uses 39B active parameters out of 141B, offering unparalleled cost efficiency for its size. Its strengths include:
- strong math, coding, and reasoning
- large context length (64k)
- fluency in English, French, Italian, German, and Spanish
See benchmarks on the launch announcement [here](https://mistral.ai/news/mixtral-8x22b/).
#moe
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
structured_outputs:
default: false
description: If true, instructs the model to return structured outputs (e.g., in JSON format) using the response_format provided.
required: false
type: boolean
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_logprobs:
default: null
description: Specifies the number of most likely tokens (from 0 to 20) to return at each token position, each with its associated log probability. (Requires that logprobs is enabled.)
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: mixtral-8x7b
model_provider: mistralai
inference_provider:
provider: openrouter
model_name: mistralai/mixtral-8x7b
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.6
per_output_token: 0.6
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 32768
description: |-
Mixtral 8x7B is a pretrained generative Sparse Mixture of Experts, by Mistral AI. Incorporates 8 experts (feed-forward networks) for a total of 47B parameters. Base model (not fine-tuned for instructions) - see [Mixtral 8x7B Instruct](/models/mistralai/mixtral-8x7b-instruct) for an instruct-tuned model.
#moe
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: mixtral-8x7b-instruct
model_provider: mistralai
inference_provider:
provider: openrouter
model_name: mistralai/mixtral-8x7b-instruct
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.24
per_output_token: 0.24
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 32768
description: |-
Mixtral 8x7B Instruct is a pretrained generative Sparse Mixture of Experts, by Mistral AI, for chat and instruction use. Incorporates 8 experts (feed-forward networks) for a total of 47 billion parameters.
Instruct model fine-tuned by Mistral. #moe
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: mixtral-8x7b-instruct:nitro
model_provider: mistralai
inference_provider:
provider: openrouter
model_name: mistralai/mixtral-8x7b-instruct:nitro
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.5
per_output_token: 0.5
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 32768
description: |-
Mixtral 8x7B Instruct is a pretrained generative Sparse Mixture of Experts, by Mistral AI, for chat and instruction use. Incorporates 8 experts (feed-forward networks) for a total of 47 billion parameters.
Instruct model fine-tuned by Mistral. #moe
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: mn-celeste-12b
model_provider: nothingiisreal
inference_provider:
provider: openrouter
model_name: nothingiisreal/mn-celeste-12b
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.7999999999999999
per_output_token: 1.2
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 16384
description: |-
A specialized story writing and roleplaying model based on Mistral's NeMo 12B Instruct. Fine-tuned on curated datasets including Reddit Writing Prompts and Opus Instruct 25K.
This model excels at creative writing, offering improved NSFW capabilities, with smarter and more active narration. It demonstrates remarkable versatility in both SFW and NSFW scenarios, with strong Out of Character (OOC) steering capabilities, allowing fine-tuned control over narrative direction and character behavior.
Check out the model's [HuggingFace page](https://huggingface.co/nothingiisreal/MN-12B-Celeste-V1.9) for details on what parameters and prompts work best!
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: mn-inferor-12b
model_provider: infermatic
inference_provider:
provider: openrouter
model_name: infermatic/mn-inferor-12b
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.7999999999999999
per_output_token: 1.2
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 16384
description: |
Inferor 12B is a merge of top roleplay models, expert on immersive narratives and storytelling.
This model was merged using the [Model Stock](https://arxiv.org/abs/2403.19522) merge method using [anthracite-org/magnum-v4-12b](https://openrouter.ai/anthracite-org/magnum-v4-72b) as a base.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: mn-starcannon-12b
model_provider: aetherwiing
inference_provider:
provider: openrouter
model_name: aetherwiing/mn-starcannon-12b
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.7999999999999999
per_output_token: 1.2
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 16384
description: |-
Starcannon 12B v2 is a creative roleplay and story writing model, based on Mistral Nemo, using [nothingiisreal/mn-celeste-12b](/nothingiisreal/mn-celeste-12b) as a base, with [intervitens/mini-magnum-12b-v1.1](https://huggingface.co/intervitens/mini-magnum-12b-v1.1) merged in using the [TIES](https://arxiv.org/abs/2306.01708) method.
Although more similar to Magnum overall, the model remains very creative, with a pleasant writing style. It is recommended for people wanting more variety than Magnum, and yet more verbose prose than Celeste.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: mythalion-13b
model_provider: pygmalionai
inference_provider:
provider: openrouter
model_name: pygmalionai/mythalion-13b
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.7999999999999999
per_output_token: 1.2
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 4096
description: 'A blend of the new Pygmalion-13b and MythoMax. #merge'
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: mythomax-l2-13b
model_provider: gryphe
inference_provider:
provider: openrouter
model_name: gryphe/mythomax-l2-13b
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.065
per_output_token: 0.065
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 4096
description: 'One of the highest performing and most popular fine-tunes of Llama 2 13B, with rich descriptions and roleplay. #merge'
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: mythomax-l2-13b:extended
model_provider: gryphe
inference_provider:
provider: openrouter
model_name: gryphe/mythomax-l2-13b:extended
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 1.125
per_output_token: 1.125
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 8192
description: 'One of the highest performing and most popular fine-tunes of Llama 2 13B, with rich descriptions and roleplay. #merge'
parameters: null
- model: mythomax-l2-13b:free
model_provider: gryphe
inference_provider:
provider: openrouter
model_name: gryphe/mythomax-l2-13b:free
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.0
per_output_token: 0.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 4096
description: 'One of the highest performing and most popular fine-tunes of Llama 2 13B, with rich descriptions and roleplay. #merge'
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: mythomax-l2-13b:nitro
model_provider: gryphe
inference_provider:
provider: openrouter
model_name: gryphe/mythomax-l2-13b:nitro
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.2
per_output_token: 0.2
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 4096
description: 'One of the highest performing and most popular fine-tunes of Llama 2 13B, with rich descriptions and roleplay. #merge'
parameters: null
- model: noromaid-20b
model_provider: neversleep
inference_provider:
provider: openrouter
model_name: neversleep/noromaid-20b
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 1.5
per_output_token: 2.25
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 8192
description: |-
A collab between IkariDev and Undi. This merge is suitable for RP, ERP, and general knowledge.
#merge #uncensored
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_a:
default: 0.0
description: Consider only tokens with sufficiently high probabilities relative to the top token. A lower value focuses the selection on tokens near the top probability, acting like a dynamic Top-P filter.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: nous-hermes-2-mixtral-8x7b-dpo
model_provider: nousresearch
inference_provider:
provider: openrouter
model_name: nousresearch/nous-hermes-2-mixtral-8x7b-dpo
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.6
per_output_token: 0.6
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 32768
description: |-
Nous Hermes 2 Mixtral 8x7B DPO is the new flagship Nous Research model trained over the [Mixtral 8x7B MoE LLM](/models/mistralai/mixtral-8x7b).
The model was trained on over 1,000,000 entries of primarily [GPT-4](/models/openai/gpt-4) generated data, as well as other high quality data from open datasets across the AI landscape, achieving state of the art performance on a variety of tasks.
#moe
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: nous-hermes-llama2-13b
model_provider: nousresearch
inference_provider:
provider: openrouter
model_name: nousresearch/nous-hermes-llama2-13b
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.16999999999999998
per_output_token: 0.16999999999999998
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 4096
description: A state-of-the-art language model fine-tuned on over 300k instructions by Nous Research, with Teknium and Emozilla leading the fine tuning process.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: nova-lite-v1
model_provider: amazon
inference_provider:
provider: openrouter
model_name: amazon/nova-lite-v1
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.06
per_output_token: 0.24
valid_from: null
input_formats:
- text
- image
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 300000
description: |-
Amazon Nova Lite 1.0 is a very low-cost multimodal model from Amazon that focused on fast processing of image, video, and text inputs to generate text output. Amazon Nova Lite can handle real-time customer interactions, document analysis, and visual question-answering tasks with high accuracy.
With an input context of 300K tokens, it can analyze multiple images or up to 30 minutes of video in a single input.
parameters:
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: nova-micro-v1
model_provider: amazon
inference_provider:
provider: openrouter
model_name: amazon/nova-micro-v1
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.035
per_output_token: 0.14
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 128000
description: Amazon Nova Micro 1.0 is a text-only model that delivers the lowest latency responses in the Amazon Nova family of models at a very low cost. With a context length of 128K tokens and optimized for speed and cost, Amazon Nova Micro excels at tasks such as text summarization, translation, content classification, interactive chat, and brainstorming. It has simple mathematical reasoning and coding abilities.
parameters:
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: nova-pro-v1
model_provider: amazon
inference_provider:
provider: openrouter
model_name: amazon/nova-pro-v1
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.7999999999999999
per_output_token: 3.1999999999999997
valid_from: null
input_formats:
- text
- image
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 300000
description: |-
Amazon Nova Pro 1.0 is a capable multimodal model from Amazon focused on providing a combination of accuracy, speed, and cost for a wide range of tasks. As of December 2024, it achieves state-of-the-art performance on key benchmarks including visual question answering (TextVQA) and video understanding (VATEX).
Amazon Nova Pro demonstrates strong capabilities in processing both visual and textual information and at analyzing financial documents.
**NOTE**: Video input is not supported at this time.
parameters:
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: o1
model_provider: openai
inference_provider:
provider: openrouter
model_name: openai/o1
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 15.0
per_output_token: 60.0
valid_from: null
input_formats:
- text
- image
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 200000
description: "The latest and strongest model family from OpenAI, o1 is designed to spend more time thinking before responding. The o1 model series is trained with large-scale reinforcement learning to reason using chain of thought. \n\nThe o1 models are optimized for math, science, programming, and other STEM-related tasks. They consistently exhibit PhD-level accuracy on benchmarks in physics, chemistry, and biology. Learn more in the [launch announcement](https://openai.com/o1).\n"
parameters:
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
structured_outputs:
default: false
description: If true, instructs the model to return structured outputs (e.g., in JSON format) using the response_format provided.
required: false
type: boolean
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
- model: o1-mini
model_provider: openai
inference_provider:
provider: openrouter
model_name: openai/o1-mini
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 1.1
per_output_token: 4.4
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 128000
description: |-
The latest and strongest model family from OpenAI, o1 is designed to spend more time thinking before responding.
The o1 models are optimized for math, science, programming, and other STEM-related tasks. They consistently exhibit PhD-level accuracy on benchmarks in physics, chemistry, and biology. Learn more in the [launch announcement](https://openai.com/o1).
Note: This model is currently experimental and not suitable for production use-cases, and may be heavily rate-limited.
parameters:
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
- model: o1-mini-2024-09-12
model_provider: openai
inference_provider:
provider: openrouter
model_name: openai/o1-mini-2024-09-12
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 1.1
per_output_token: 4.4
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 128000
description: |-
The latest and strongest model family from OpenAI, o1 is designed to spend more time thinking before responding.
The o1 models are optimized for math, science, programming, and other STEM-related tasks. They consistently exhibit PhD-level accuracy on benchmarks in physics, chemistry, and biology. Learn more in the [launch announcement](https://openai.com/o1).
Note: This model is currently experimental and not suitable for production use-cases, and may be heavily rate-limited.
parameters:
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
- model: o1-preview
model_provider: openai
inference_provider:
provider: openrouter
model_name: openai/o1-preview
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 15.0
per_output_token: 60.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 128000
description: |-
The latest and strongest model family from OpenAI, o1 is designed to spend more time thinking before responding.
The o1 models are optimized for math, science, programming, and other STEM-related tasks. They consistently exhibit PhD-level accuracy on benchmarks in physics, chemistry, and biology. Learn more in the [launch announcement](https://openai.com/o1).
Note: This model is currently experimental and not suitable for production use-cases, and may be heavily rate-limited.
parameters:
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
- model: o1-preview-2024-09-12
model_provider: openai
inference_provider:
provider: openrouter
model_name: openai/o1-preview-2024-09-12
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 15.0
per_output_token: 60.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 128000
description: |-
The latest and strongest model family from OpenAI, o1 is designed to spend more time thinking before responding.
The o1 models are optimized for math, science, programming, and other STEM-related tasks. They consistently exhibit PhD-level accuracy on benchmarks in physics, chemistry, and biology. Learn more in the [launch announcement](https://openai.com/o1).
Note: This model is currently experimental and not suitable for production use-cases, and may be heavily rate-limited.
parameters:
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
- model: o3-mini
model_provider: openai
inference_provider:
provider: openrouter
model_name: openai/o3-mini
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 1.1
per_output_token: 4.4
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 200000
description: |-
OpenAI o3-mini is a cost-efficient language model optimized for STEM reasoning tasks, particularly excelling in science, mathematics, and coding. The model features three adjustable reasoning effort levels and supports key developer capabilities including function calling, structured outputs, and streaming, though it does not include vision processing capabilities.
The model demonstrates significant improvements over its predecessor, with expert testers preferring its responses 56% of the time and noting a 39% reduction in major errors on complex questions. With medium reasoning effort settings, o3-mini matches the performance of the larger o1 model on challenging reasoning evaluations like AIME and GPQA, while maintaining lower latency and cost.
parameters:
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
structured_outputs:
default: false
description: If true, instructs the model to return structured outputs (e.g., in JSON format) using the response_format provided.
required: false
type: boolean
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
- model: openchat-7b
model_provider: openchat
inference_provider:
provider: openrouter
model_name: openchat/openchat-7b
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.055
per_output_token: 0.055
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 8192
description: |-
OpenChat 7B is a library of open-source language models, fine-tuned with "C-RLFT (Conditioned Reinforcement Learning Fine-Tuning)" - a strategy inspired by offline reinforcement learning. It has been trained on mixed-quality data without preference labels.
- For OpenChat fine-tuned on Mistral 7B, check out [OpenChat 7B](/models/openchat/openchat-7b).
- For OpenChat fine-tuned on Llama 8B, check out [OpenChat 8B](/models/openchat/openchat-8b).
#open-source
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: openchat-7b:free
model_provider: openchat
inference_provider:
provider: openrouter
model_name: openchat/openchat-7b:free
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.0
per_output_token: 0.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 8192
description: |-
OpenChat 7B is a library of open-source language models, fine-tuned with "C-RLFT (Conditioned Reinforcement Learning Fine-Tuning)" - a strategy inspired by offline reinforcement learning. It has been trained on mixed-quality data without preference labels.
- For OpenChat fine-tuned on Mistral 7B, check out [OpenChat 7B](/models/openchat/openchat-7b).
- For OpenChat fine-tuned on Llama 8B, check out [OpenChat 8B](/models/openchat/openchat-8b).
#open-source
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: openhermes-2.5-mistral-7b
model_provider: teknium
inference_provider:
provider: openrouter
model_name: teknium/openhermes-2.5-mistral-7b
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.16999999999999998
per_output_token: 0.16999999999999998
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 4096
description: |-
A continuation of [OpenHermes 2 model](/models/teknium/openhermes-2-mistral-7b), trained on additional code datasets.
Potentially the most interesting finding from training on a good ratio (est. of around 7-14% of the total dataset) of code instruction was that it has boosted several non-code benchmarks, including TruthfulQA, AGIEval, and GPT4All suite. It did however reduce BigBench benchmark score, but the net gain overall is significant.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: palm-2-chat-bison
model_provider: google
inference_provider:
provider: openrouter
model_name: google/palm-2-chat-bison
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 1.0
per_output_token: 2.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 9216
description: PaLM 2 is a language model by Google with improved multilingual, reasoning and coding capabilities.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: palm-2-chat-bison-32k
model_provider: google
inference_provider:
provider: openrouter
model_name: google/palm-2-chat-bison-32k
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 1.0
per_output_token: 2.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 32768
description: PaLM 2 is a language model by Google with improved multilingual, reasoning and coding capabilities.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: palm-2-codechat-bison
model_provider: google
inference_provider:
provider: openrouter
model_name: google/palm-2-codechat-bison
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 1.0
per_output_token: 2.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 7168
description: PaLM 2 fine-tuned for chatbot conversations that help with code-related questions.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: palm-2-codechat-bison-32k
model_provider: google
inference_provider:
provider: openrouter
model_name: google/palm-2-codechat-bison-32k
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 1.0
per_output_token: 2.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 32768
description: PaLM 2 fine-tuned for chatbot conversations that help with code-related questions.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: phi-3.5-mini-128k-instruct
model_provider: microsoft
inference_provider:
provider: openrouter
model_name: microsoft/phi-3.5-mini-128k-instruct
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.1
per_output_token: 0.1
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 128000
description: |-
Phi-3.5 models are lightweight, state-of-the-art open models. These models were trained with Phi-3 datasets that include both synthetic data and the filtered, publicly available websites data, with a focus on high quality and reasoning-dense properties. Phi-3.5 Mini uses 3.8B parameters, and is a dense decoder-only transformer model using the same tokenizer as [Phi-3 Mini](/models/microsoft/phi-3-mini-128k-instruct).
The models underwent a rigorous enhancement process, incorporating both supervised fine-tuning, proximal policy optimization, and direct preference optimization to ensure precise instruction adherence and robust safety measures. When assessed against benchmarks that test common sense, language understanding, math, code, long context and logical reasoning, Phi-3.5 models showcased robust and state-of-the-art performance among models with less than 13 billion parameters.
parameters:
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: phi-3-medium-128k-instruct
model_provider: microsoft
inference_provider:
provider: openrouter
model_name: microsoft/phi-3-medium-128k-instruct
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 1.0
per_output_token: 1.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 128000
description: |-
Phi-3 128K Medium is a powerful 14-billion parameter model designed for advanced language understanding, reasoning, and instruction following. Optimized through supervised fine-tuning and preference adjustments, it excels in tasks involving common sense, mathematics, logical reasoning, and code processing.
At time of release, Phi-3 Medium demonstrated state-of-the-art performance among lightweight models. In the MMLU-Pro eval, the model even comes close to a Llama3 70B level of performance.
For 4k context length, try [Phi-3 Medium 4K](/models/microsoft/phi-3-medium-4k-instruct).
parameters:
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: phi-3-medium-128k-instruct:free
model_provider: microsoft
inference_provider:
provider: openrouter
model_name: microsoft/phi-3-medium-128k-instruct:free
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.0
per_output_token: 0.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 8192
description: |-
Phi-3 128K Medium is a powerful 14-billion parameter model designed for advanced language understanding, reasoning, and instruction following. Optimized through supervised fine-tuning and preference adjustments, it excels in tasks involving common sense, mathematics, logical reasoning, and code processing.
At time of release, Phi-3 Medium demonstrated state-of-the-art performance among lightweight models. In the MMLU-Pro eval, the model even comes close to a Llama3 70B level of performance.
For 4k context length, try [Phi-3 Medium 4K](/models/microsoft/phi-3-medium-4k-instruct).
parameters:
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: phi-3-mini-128k-instruct
model_provider: microsoft
inference_provider:
provider: openrouter
model_name: microsoft/phi-3-mini-128k-instruct
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.1
per_output_token: 0.1
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 128000
description: |-
Phi-3 Mini is a powerful 3.8B parameter model designed for advanced language understanding, reasoning, and instruction following. Optimized through supervised fine-tuning and preference adjustments, it excels in tasks involving common sense, mathematics, logical reasoning, and code processing.
At time of release, Phi-3 Medium demonstrated state-of-the-art performance among lightweight models. This model is static, trained on an offline dataset with an October 2023 cutoff date.
parameters:
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: phi-3-mini-128k-instruct:free
model_provider: microsoft
inference_provider:
provider: openrouter
model_name: microsoft/phi-3-mini-128k-instruct:free
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.0
per_output_token: 0.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 8192
description: |-
Phi-3 Mini is a powerful 3.8B parameter model designed for advanced language understanding, reasoning, and instruction following. Optimized through supervised fine-tuning and preference adjustments, it excels in tasks involving common sense, mathematics, logical reasoning, and code processing.
At time of release, Phi-3 Medium demonstrated state-of-the-art performance among lightweight models. This model is static, trained on an offline dataset with an October 2023 cutoff date.
parameters:
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: phi-4
model_provider: microsoft
inference_provider:
provider: openrouter
model_name: microsoft/phi-4
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.07
per_output_token: 0.14
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 16384
description: "[Microsoft Research](/microsoft) Phi-4 is designed to perform well in complex reasoning tasks and can operate efficiently in situations with limited memory or where quick responses are needed. \n\nAt 14 billion parameters, it was trained on a mix of high-quality synthetic datasets, data from curated websites, and academic materials. It has undergone careful improvement to follow instructions accurately and maintain strong safety standards. It works best with English language inputs.\n\nFor more information, please see [Phi-4 Technical Report](https://arxiv.org/pdf/2412.08905)\n"
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: pixtral-12b
model_provider: mistralai
inference_provider:
provider: openrouter
model_name: mistralai/pixtral-12b
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.1
per_output_token: 0.1
valid_from: null
input_formats:
- text
- image
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 4096
description: 'The first multi-modal, text+image-to-text model from Mistral AI. Its weights were launched via torrent: https://x.com/mistralai/status/1833758285167722836.'
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_logprobs:
default: null
description: Specifies the number of most likely tokens (from 0 to 20) to return at each token position, each with its associated log probability. (Requires that logprobs is enabled.)
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: pixtral-large-2411
model_provider: mistralai
inference_provider:
provider: openrouter
model_name: mistralai/pixtral-large-2411
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 2.0
per_output_token: 6.0
valid_from: null
input_formats:
- text
- image
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 128000
description: |+
Pixtral Large is a 124B parameter, open-weight, multimodal model built on top of [Mistral Large 2](/mistralai/mistral-large-2411). The model is able to understand documents, charts and natural images.
The model is available under the Mistral Research License (MRL) for research and educational use, and the Mistral Commercial License for experimentation, testing, and production for commercial purposes.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: qvq-72b-preview
model_provider: qwen
inference_provider:
provider: openrouter
model_name: qwen/qvq-72b-preview
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.25
per_output_token: 0.5
valid_from: null
input_formats:
- text
- image
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 32000
description: |-
QVQ-72B-Preview is an experimental research model developed by the [Qwen](/qwen) team, focusing on enhancing visual reasoning capabilities.
## Performance
| | **QVQ-72B-Preview** | o1-2024-12-17 | gpt-4o-2024-05-13 | Claude3.5 Sonnet-20241022 | Qwen2VL-72B |
|----------------|-----------------|---------------|-------------------|----------------------------|-------------|
| MMMU(val) | 70.3 | 77.3 | 69.1 | 70.4 | 64.5 |
| MathVista(mini) | 71.4 | 71.0 | 63.8 | 65.3 | 70.5 |
| MathVision(full) | 35.9 | – | 30.4 | 35.6 | 25.9 |
| OlympiadBench | 20.4 | – | 25.9 | – | 11.2 |
## Limitations
1. **Language Mixing and Code-Switching:** The model might occasionally mix different languages or unexpectedly switch between them, potentially affecting the clarity of its responses.
2. **Recursive Reasoning Loops:** There's a risk of the model getting caught in recursive reasoning loops, leading to lengthy responses that may not even arrive at a final answer.
3. **Safety and Ethical Considerations:** Robust safety measures are needed to ensure reliable and safe performance. Users should exercise caution when deploying this model.
4. **Performance and Benchmark Limitations:** Despite the improvements in visual reasoning, QVQ doesn’t entirely replace the capabilities of [Qwen2-VL-72B](/qwen/qwen-2-vl-72b-instruct). During multi-step visual reasoning, the model might gradually lose focus on the image content, leading to hallucinations. Moreover, QVQ doesn’t show significant improvement over [Qwen2-VL-72B](/qwen/qwen-2-vl-72b-instruct) in basic recognition tasks like identifying people, animals, or plants.
Note: Currently, the model only supports single-round dialogues and image outputs. It does not support video inputs.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: qwen-2.5-72b-instruct
model_provider: qwen
inference_provider:
provider: openrouter
model_name: qwen/qwen-2.5-72b-instruct
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.13
per_output_token: 0.4
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 128000
description: |-
Qwen2.5 72B is the latest series of Qwen large language models. Qwen2.5 brings the following improvements upon Qwen2:
- Significantly more knowledge and has greatly improved capabilities in coding and mathematics, thanks to our specialized expert models in these domains.
- Significant improvements in instruction following, generating long texts (over 8K tokens), understanding structured data (e.g, tables), and generating structured outputs especially JSON. More resilient to the diversity of system prompts, enhancing role-play implementation and condition-setting for chatbots.
- Long-context Support up to 128K tokens and can generate up to 8K tokens.
- Multilingual support for over 29 languages, including Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, Arabic, and more.
Usage of this model is subject to [Tongyi Qianwen LICENSE AGREEMENT](https://huggingface.co/Qwen/Qwen1.5-110B-Chat/blob/main/LICENSE).
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_logprobs:
default: null
description: Specifies the number of most likely tokens (from 0 to 20) to return at each token position, each with its associated log probability. (Requires that logprobs is enabled.)
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: qwen-2.5-7b-instruct
model_provider: qwen
inference_provider:
provider: openrouter
model_name: qwen/qwen-2.5-7b-instruct
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.025
per_output_token: 0.05
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 32768
description: |-
Qwen2.5 7B is the latest series of Qwen large language models. Qwen2.5 brings the following improvements upon Qwen2:
- Significantly more knowledge and has greatly improved capabilities in coding and mathematics, thanks to our specialized expert models in these domains.
- Significant improvements in instruction following, generating long texts (over 8K tokens), understanding structured data (e.g, tables), and generating structured outputs especially JSON. More resilient to the diversity of system prompts, enhancing role-play implementation and condition-setting for chatbots.
- Long-context Support up to 128K tokens and can generate up to 8K tokens.
- Multilingual support for over 29 languages, including Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, Arabic, and more.
Usage of this model is subject to [Tongyi Qianwen LICENSE AGREEMENT](https://huggingface.co/Qwen/Qwen1.5-110B-Chat/blob/main/LICENSE).
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: qwen-2.5-coder-32b-instruct
model_provider: qwen
inference_provider:
provider: openrouter
model_name: qwen/qwen-2.5-coder-32b-instruct
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.07
per_output_token: 0.16
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 33000
description: "Qwen2.5-Coder is the latest series of Code-Specific Qwen large language models (formerly known as CodeQwen). Qwen2.5-Coder brings the following improvements upon CodeQwen1.5:\n\n- Significantly improvements in **code generation**, **code reasoning** and **code fixing**. \n- A more comprehensive foundation for real-world applications such as **Code Agents**. Not only enhancing coding capabilities but also maintaining its strengths in mathematics and general competencies.\n\nTo read more about its evaluation results, check out [Qwen 2.5 Coder's blog](https://qwenlm.github.io/blog/qwen2.5-coder-family/)."
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_logprobs:
default: null
description: Specifies the number of most likely tokens (from 0 to 20) to return at each token position, each with its associated log probability. (Requires that logprobs is enabled.)
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: qwen2.5-vl-72b-instruct:free
model_provider: qwen
inference_provider:
provider: openrouter
model_name: qwen/qwen2.5-vl-72b-instruct:free
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.0
per_output_token: 0.0
valid_from: null
input_formats:
- text
- image
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 131072
description: Qwen2.5-VL is proficient in recognizing common objects such as flowers, birds, fish, and insects. It is also highly capable of analyzing texts, charts, icons, graphics, and layouts within images.
parameters:
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: qwen-2-72b-instruct
model_provider: qwen
inference_provider:
provider: openrouter
model_name: qwen/qwen-2-72b-instruct
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.8999999999999999
per_output_token: 0.8999999999999999
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 32768
description: |-
Qwen2 72B is a transformer-based model that excels in language understanding, multilingual capabilities, coding, mathematics, and reasoning.
It features SwiGLU activation, attention QKV bias, and group query attention. It is pretrained on extensive data with supervised finetuning and direct preference optimization.
For more details, see this [blog post](https://qwenlm.github.io/blog/qwen2/) and [GitHub repo](https://github.com/QwenLM/Qwen2).
Usage of this model is subject to [Tongyi Qianwen LICENSE AGREEMENT](https://huggingface.co/Qwen/Qwen1.5-110B-Chat/blob/main/LICENSE).
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: qwen-2-7b-instruct
model_provider: qwen
inference_provider:
provider: openrouter
model_name: qwen/qwen-2-7b-instruct
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.054
per_output_token: 0.054
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 32768
description: |-
Qwen2 7B is a transformer-based model that excels in language understanding, multilingual capabilities, coding, mathematics, and reasoning.
It features SwiGLU activation, attention QKV bias, and group query attention. It is pretrained on extensive data with supervised finetuning and direct preference optimization.
For more details, see this [blog post](https://qwenlm.github.io/blog/qwen2/) and [GitHub repo](https://github.com/QwenLM/Qwen2).
Usage of this model is subject to [Tongyi Qianwen LICENSE AGREEMENT](https://huggingface.co/Qwen/Qwen1.5-110B-Chat/blob/main/LICENSE).
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: qwen-2-7b-instruct:free
model_provider: qwen
inference_provider:
provider: openrouter
model_name: qwen/qwen-2-7b-instruct:free
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.0
per_output_token: 0.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 8192
description: |-
Qwen2 7B is a transformer-based model that excels in language understanding, multilingual capabilities, coding, mathematics, and reasoning.
It features SwiGLU activation, attention QKV bias, and group query attention. It is pretrained on extensive data with supervised finetuning and direct preference optimization.
For more details, see this [blog post](https://qwenlm.github.io/blog/qwen2/) and [GitHub repo](https://github.com/QwenLM/Qwen2).
Usage of this model is subject to [Tongyi Qianwen LICENSE AGREEMENT](https://huggingface.co/Qwen/Qwen1.5-110B-Chat/blob/main/LICENSE).
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: qwen-2-vl-72b-instruct
model_provider: qwen
inference_provider:
provider: openrouter
model_name: qwen/qwen-2-vl-72b-instruct
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.4
per_output_token: 0.4
valid_from: null
input_formats:
- text
- image
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 4096
description: |-
Qwen2 VL 72B is a multimodal LLM from the Qwen Team with the following key enhancements:
- SoTA understanding of images of various resolution & ratio: Qwen2-VL achieves state-of-the-art performance on visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, MTVQA, etc.
- Understanding videos of 20min+: Qwen2-VL can understand videos over 20 minutes for high-quality video-based question answering, dialog, content creation, etc.
- Agent that can operate your mobiles, robots, etc.: with the abilities of complex reasoning and decision making, Qwen2-VL can be integrated with devices like mobile phones, robots, etc., for automatic operation based on visual environment and text instructions.
- Multilingual Support: to serve global users, besides English and Chinese, Qwen2-VL now supports the understanding of texts in different languages inside images, including most European languages, Japanese, Korean, Arabic, Vietnamese, etc.
For more details, see this [blog post](https://qwenlm.github.io/blog/qwen2-vl/) and [GitHub repo](https://github.com/QwenLM/Qwen2-VL).
Usage of this model is subject to [Tongyi Qianwen LICENSE AGREEMENT](https://huggingface.co/Qwen/Qwen1.5-110B-Chat/blob/main/LICENSE).
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_logprobs:
default: null
description: Specifies the number of most likely tokens (from 0 to 20) to return at each token position, each with its associated log probability. (Requires that logprobs is enabled.)
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: qwen-2-vl-7b-instruct
model_provider: qwen
inference_provider:
provider: openrouter
model_name: qwen/qwen-2-vl-7b-instruct
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.1
per_output_token: 0.1
valid_from: null
input_formats:
- text
- image
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 4096
description: |-
Qwen2 VL 7B is a multimodal LLM from the Qwen Team with the following key enhancements:
- SoTA understanding of images of various resolution & ratio: Qwen2-VL achieves state-of-the-art performance on visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, MTVQA, etc.
- Understanding videos of 20min+: Qwen2-VL can understand videos over 20 minutes for high-quality video-based question answering, dialog, content creation, etc.
- Agent that can operate your mobiles, robots, etc.: with the abilities of complex reasoning and decision making, Qwen2-VL can be integrated with devices like mobile phones, robots, etc., for automatic operation based on visual environment and text instructions.
- Multilingual Support: to serve global users, besides English and Chinese, Qwen2-VL now supports the understanding of texts in different languages inside images, including most European languages, Japanese, Korean, Arabic, Vietnamese, etc.
For more details, see this [blog post](https://qwenlm.github.io/blog/qwen2-vl/) and [GitHub repo](https://github.com/QwenLM/Qwen2-VL).
Usage of this model is subject to [Tongyi Qianwen LICENSE AGREEMENT](https://huggingface.co/Qwen/Qwen1.5-110B-Chat/blob/main/LICENSE).
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_logprobs:
default: null
description: Specifies the number of most likely tokens (from 0 to 20) to return at each token position, each with its associated log probability. (Requires that logprobs is enabled.)
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: qwen-max
model_provider: qwen
inference_provider:
provider: openrouter
model_name: qwen/qwen-max
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 1.6
per_output_token: 6.3999999999999995
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 32768
description: Qwen-Max, based on Qwen2.5, provides the best inference performance among [Qwen models](/qwen), especially for complex multi-step tasks. It's a large-scale MoE model that has been pretrained on over 20 trillion tokens and further post-trained with curated Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) methodologies. The parameter count is unknown.
parameters:
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: qwen-plus
model_provider: qwen
inference_provider:
provider: openrouter
model_name: qwen/qwen-plus
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.4
per_output_token: 1.2
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 131072
description: Qwen-Plus, based on the Qwen2.5 foundation model, is a 131K context model with a balanced performance, speed, and cost combination.
parameters:
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: qwen-turbo
model_provider: qwen
inference_provider:
provider: openrouter
model_name: qwen/qwen-turbo
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.05
per_output_token: 0.2
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities:
- tools
type: completions
limits:
max_context_size: 1000000
description: Qwen-Turbo, based on Qwen2.5, is a 1M context model that provides fast speed and low cost, suitable for simple tasks.
parameters:
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
tool_choice:
default: none
description: Controls which (if any) tool is called by the model. Accepted values include 'none' (no tool call), 'auto' (model decides), 'required' (must call a tool), or a specific tool identifier/object.
required: false
type: string
tools:
default: []
description: A list of tools available for or used during the generation process. This follows a specific tool-calling schema.
required: false
type: array
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: qwen-vl-plus:free
model_provider: qwen
inference_provider:
provider: openrouter
model_name: qwen/qwen-vl-plus:free
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.0
per_output_token: 0.0
valid_from: null
input_formats:
- text
- image
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 7500
description: |
Qwen's Enhanced Large Visual Language Model. Significantly upgraded for detailed recognition capabilities and text recognition abilities, supporting ultra-high pixel resolutions up to millions of pixels and extreme aspect ratios for image input. It delivers significant performance across a broad range of visual tasks.
parameters:
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: qwq-32b-preview
model_provider: qwen
inference_provider:
provider: openrouter
model_name: qwen/qwq-32b-preview
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.12
per_output_token: 0.18
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 32768
description: |+
QwQ-32B-Preview is an experimental research model focused on AI reasoning capabilities developed by the Qwen Team. As a preview release, it demonstrates promising analytical abilities while having several important limitations:
1. **Language Mixing and Code-Switching**: The model may mix languages or switch between them unexpectedly, affecting response clarity.
2. **Recursive Reasoning Loops**: The model may enter circular reasoning patterns, leading to lengthy responses without a conclusive answer.
3. **Safety and Ethical Considerations**: The model requires enhanced safety measures to ensure reliable and secure performance, and users should exercise caution when deploying it.
4. **Performance and Benchmark Limitations**: The model excels in math and coding but has room for improvement in other areas, such as common sense reasoning and nuanced language understanding.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: remm-slerp-l2-13b
model_provider: undi95
inference_provider:
provider: openrouter
model_name: undi95/remm-slerp-l2-13b
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.7999999999999999
per_output_token: 1.2
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 4096
description: 'A recreation trial of the original MythoMax-L2-B13 but with updated models. #merge'
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: remm-slerp-l2-13b:extended
model_provider: undi95
inference_provider:
provider: openrouter
model_name: undi95/remm-slerp-l2-13b:extended
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 1.125
per_output_token: 1.125
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 6144
description: 'A recreation trial of the original MythoMax-L2-B13 but with updated models. #merge'
parameters: null
- model: rocinante-12b
model_provider: thedrummer
inference_provider:
provider: openrouter
model_name: thedrummer/rocinante-12b
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.25
per_output_token: 0.5
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 32768
description: |-
Rocinante 12B is designed for engaging storytelling and rich prose.
Early testers have reported:
- Expanded vocabulary with unique and expressive word choices
- Enhanced creativity for vivid narratives
- Adventure-filled and captivating stories
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: rogue-rose-103b-v0.2:free
model_provider: sophosympatheia
inference_provider:
provider: openrouter
model_name: sophosympatheia/rogue-rose-103b-v0.2:free
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.0
per_output_token: 0.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 4096
description: |
Rogue Rose demonstrates strong capabilities in roleplaying and storytelling applications, potentially surpassing other models in the 103-120B parameter range. While it occasionally exhibits inconsistencies with scene logic, the overall interaction quality represents an advancement in natural language processing for creative applications.
It is a 120-layer frankenmerge model combining two custom 70B architectures from November 2023, derived from the [xwin-stellarbright-erp-70b-v2](https://huggingface.co/sophosympatheia/xwin-stellarbright-erp-70b-v2) base.
parameters:
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: sonar
model_provider: perplexity
inference_provider:
provider: openrouter
model_name: perplexity/sonar
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 1.0
per_output_token: 1.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 127072
description: Sonar is lightweight, affordable, fast, and simple to use — now featuring citations and the ability to customize sources. It is designed for companies seeking to integrate lightweight question-and-answer features optimized for speed.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: sonar-reasoning
model_provider: perplexity
inference_provider:
provider: openrouter
model_name: perplexity/sonar-reasoning
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 1.0
per_output_token: 5.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 127000
description: "Sonar Reasoning is a reasoning model provided by Perplexity based on [DeepSeek R1](/deepseek/deepseek-r1).\n\nIt allows developers to utilize long chain of thought with built-in web search. Sonar Reasoning is uncensored and hosted in US datacenters. "
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
include_reasoning:
default: false
description: If the endpoint can return reasoning explicitly, setting this parameter will include reasoning tokens in the response (available in a separate field).
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: sorcererlm-8x22b
model_provider: raifle
inference_provider:
provider: openrouter
model_name: raifle/sorcererlm-8x22b
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 4.5
per_output_token: 4.5
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 16000
description: |-
SorcererLM is an advanced RP and storytelling model, built as a Low-rank 16-bit LoRA fine-tuned on [WizardLM-2 8x22B](/microsoft/wizardlm-2-8x22b).
- Advanced reasoning and emotional intelligence for engaging and immersive interactions
- Vivid writing capabilities enriched with spatial and contextual awareness
- Enhanced narrative depth, promoting creative and dynamic storytelling
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: toppy-m-7b
model_provider: undi95
inference_provider:
provider: openrouter
model_name: undi95/toppy-m-7b
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.07
per_output_token: 0.07
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 4096
description: |-
A wild 7B parameter model that merges several models using the new task_arithmetic merge method from mergekit.
List of merged models:
- NousResearch/Nous-Capybara-7B-V1.9
- [HuggingFaceH4/zephyr-7b-beta](/models/huggingfaceh4/zephyr-7b-beta)
- lemonilia/AshhLimaRP-Mistral-7B
- Vulkane/120-Days-of-Sodom-LoRA-Mistral-7b
- Undi95/Mistral-pippa-sharegpt-7b-qlora
#merge #uncensored
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: toppy-m-7b:free
model_provider: undi95
inference_provider:
provider: openrouter
model_name: undi95/toppy-m-7b:free
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.0
per_output_token: 0.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 4096
description: |-
A wild 7B parameter model that merges several models using the new task_arithmetic merge method from mergekit.
List of merged models:
- NousResearch/Nous-Capybara-7B-V1.9
- [HuggingFaceH4/zephyr-7b-beta](/models/huggingfaceh4/zephyr-7b-beta)
- lemonilia/AshhLimaRP-Mistral-7B
- Vulkane/120-Days-of-Sodom-LoRA-Mistral-7b
- Undi95/Mistral-pippa-sharegpt-7b-qlora
#merge #uncensored
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: toppy-m-7b:nitro
model_provider: undi95
inference_provider:
provider: openrouter
model_name: undi95/toppy-m-7b:nitro
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.07
per_output_token: 0.07
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 4096
description: |-
A wild 7B parameter model that merges several models using the new task_arithmetic merge method from mergekit.
List of merged models:
- NousResearch/Nous-Capybara-7B-V1.9
- [HuggingFaceH4/zephyr-7b-beta](/models/huggingfaceh4/zephyr-7b-beta)
- lemonilia/AshhLimaRP-Mistral-7B
- Vulkane/120-Days-of-Sodom-LoRA-Mistral-7b
- Undi95/Mistral-pippa-sharegpt-7b-qlora
#merge #uncensored
parameters: null
- model: unslopnemo-12b
model_provider: thedrummer
inference_provider:
provider: openrouter
model_name: thedrummer/unslopnemo-12b
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.5
per_output_token: 0.5
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 32000
description: UnslopNemo v4.1 is the latest addition from the creator of Rocinante, designed for adventure writing and role-play scenarios.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: weaver
model_provider: mancer
inference_provider:
provider: openrouter
model_name: mancer/weaver
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 1.5
per_output_token: 2.25
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 8000
description: An attempt to recreate Claude-style verbosity, but don't expect the same level of coherence or memory. Meant for use in roleplay/narrative situations.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_a:
default: 0.0
description: Consider only tokens with sufficiently high probabilities relative to the top token. A lower value focuses the selection on tokens near the top probability, acting like a dynamic Top-P filter.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: wizardlm-2-7b
model_provider: microsoft
inference_provider:
provider: openrouter
model_name: microsoft/wizardlm-2-7b
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.07
per_output_token: 0.07
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 32000
description: |-
WizardLM-2 7B is the smaller variant of Microsoft AI's latest Wizard model. It is the fastest and achieves comparable performance with existing 10x larger opensource leading models
It is a finetune of [Mistral 7B Instruct](/models/mistralai/mistral-7b-instruct), using the same technique as [WizardLM-2 8x22B](/models/microsoft/wizardlm-2-8x22b).
To read more about the model release, [click here](https://wizardlm.github.io/WizardLM2/).
#moe
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: wizardlm-2-8x22b
model_provider: microsoft
inference_provider:
provider: openrouter
model_name: microsoft/wizardlm-2-8x22b
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.5
per_output_token: 0.5
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 65536
description: |-
WizardLM-2 8x22B is Microsoft AI's most advanced Wizard model. It demonstrates highly competitive performance compared to leading proprietary models, and it consistently outperforms all existing state-of-the-art opensource models.
It is an instruct finetune of [Mixtral 8x22B](/models/mistralai/mixtral-8x22b).
To read more about the model release, [click here](https://wizardlm.github.io/WizardLM2/).
#moe
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: xwin-lm-70b
model_provider: xwin-lm
inference_provider:
provider: openrouter
model_name: xwin-lm/xwin-lm-70b
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 3.75
per_output_token: 3.75
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 8192
description: Xwin-LM aims to develop and open-source alignment tech for LLMs. Our first release, built-upon on the [Llama2](/models/${Model.Llama_2_13B_Chat}) base models, ranked TOP-1 on AlpacaEval. Notably, it's the first to surpass [GPT-4](/models/${Model.GPT_4}) on this benchmark. The project will be continuously updated.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
min_p:
default: 0.0
description: Represents the minimum probability for a token to be considered, relative to the most likely token. For example, a value of 0.1 means only tokens with at least 10% of the top token’s probability are allowed.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
seed:
default: null
description: If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed, and you should refer to the system_fingerprint response parameter to monitor changes in the backend.
max: null
min: null
required: false
step: 1
type: int
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_a:
default: 0.0
description: Consider only tokens with sufficiently high probabilities relative to the top token. A lower value focuses the selection on tokens near the top probability, acting like a dynamic Top-P filter.
max: 1.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: yi-large
model_provider: 01-ai
inference_provider:
provider: openrouter
model_name: 01-ai/yi-large
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 3.0
per_output_token: 3.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 32768
description: |-
The Yi Large model was designed by 01.AI with the following usecases in mind: knowledge search, data classification, human-like chat bots, and customer service.
It stands out for its multilingual proficiency, particularly in Spanish, Chinese, Japanese, German, and French.
Check out the [launch announcement](https://01-ai.github.io/blog/01.ai-yi-large-llm-launch) to learn more.
parameters:
frequency_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
max: 2
min: -2
required: false
step: 0.1
type: float
logit_bias:
default: {}
description: A JSON object mapping token IDs to bias values. These biases (typically between -100 and 100) are added to the logits before sampling, affecting token selection.
required: false
type: object
logprobs:
default: false
description: Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned.
required: false
type: boolean
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
presence_penalty:
default: 0
description: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
max: 1.999
min: -2
required: false
step: 0.1
type: float
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
response_format:
default:
type: json_object
description: 'Forces the model to produce output in a specific format. For example, setting this to { ''type'': ''json_object'' } enables JSON mode, ensuring the response is valid JSON.'
required: false
type: object
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
structured_outputs:
default: false
description: If true, instructs the model to return structured outputs (e.g., in JSON format) using the response_format provided.
required: false
type: boolean
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_logprobs:
default: null
description: Specifies the number of most likely tokens (from 0 to 20) to return at each token position, each with its associated log probability. (Requires that logprobs is enabled.)
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float
- model: zephyr-7b-beta:free
model_provider: huggingfaceh4
inference_provider:
provider: openrouter
model_name: huggingfaceh4/zephyr-7b-beta:free
endpoint: https://openrouter.ai/api/v1
price:
per_input_token: 0.0
per_output_token: 0.0
valid_from: null
input_formats:
- text
output_formats:
- text
capabilities: []
type: completions
limits:
max_context_size: 4096
description: Zephyr is a series of language models that are trained to act as helpful assistants. Zephyr-7B-β is the second model in the series, and is a fine-tuned version of [mistralai/Mistral-7B-v0.1](/models/mistralai/mistral-7b-instruct-v0.1) that was trained on a mix of publicly available, synthetic datasets using Direct Preference Optimization (DPO).
parameters:
max_tokens:
default: 1000
description: The maximum number of tokens that can be generated in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length.
max: null
min: null
required: false
type: int
repetition_penalty:
default: 1.0
description: Helps reduce repetition in the output. Higher values (up to 2.0) make the model less likely to repeat tokens, whereas values closer to 0.0 encourage token reuse.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
stop:
default: null
description: Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
max: null
min: null
required: false
type: string/array
temperature:
default: 1.0
description: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
max: 2.0
min: 0.0
required: false
step: 0.1
type: float
top_k:
default: 0
description: Limits the token sampling to only the top K tokens. A value of 0 disables this setting, allowing the model to consider all tokens.
min: 0
required: false
step: 1
type: int
top_p:
default: 1
description: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
max: 1
min: 0
required: false
step: 0.05
type: float