Get the TRITONBACKEND API version supported by Triton. This value
can be compared against the TRITONBACKEND_API_VERSION_MAJOR and
TRITONBACKEND_API_VERSION_MINOR used to build the backend to
ensure that Triton is compatible with the backend.
Get the location of the files that make up the backend
implementation. This location contains the backend shared library
and any other files located with the shared library. The
‘location’ communicated depends on how the backend is being
communicated to Triton as indicated by ‘artifact_type’.
Add the preferred instance group of the backend. This function
can be called multiple times to cover different instance group kinds that
the backend supports, given the priority order that the first call describes
the most preferred group. In the case where instance group are not
explicitly provided, Triton will use this attribute to create model
deployment that aligns more with the backend preference.
Sets whether or not the backend supports concurrently loading multiple
TRITONBACKEND_ModelInstances in a thread-safe manner.
Get the backend configuration. The ‘backend_config’ message is
owned by Triton and should not be modified or freed by the caller.
Get the execution policy for this backend. By default the
execution policy is TRITONBACKEND_EXECUTION_BLOCKING.
Get the memory manager associated with a backend.
Get the name of the backend. The caller does not own the returned
string and must not modify or delete it. The lifetime of the
returned string extends only as long as ‘backend’.
Set the execution policy for this backend. By default the
execution policy is TRITONBACKEND_EXECUTION_BLOCKING. Triton reads
the backend’s execution policy after calling
TRITONBACKEND_Initialize, so to be recognized changes to the
execution policy must be made in TRITONBACKEND_Initialize.
Also, note that if using sequence batcher for the model, Triton will
use TRITONBACKEND_EXECUTION_BLOCKING policy irrespective of the
policy specified by this setter function.
Set the user-specified state associated with the backend. The
state is completely owned and managed by the backend.
Get the user-specified state associated with the backend. The
state is completely owned and managed by the backend.
Finalize for a backend. This function is optional, a backend is
not required to implement it. This function is called once, just
before the backend is unloaded. All state associated with the
backend should be freed and any threads created for the backend
should be exited/joined before returning from this function.
Query the backend for different model attributes. This function is optional,
a backend is not required to implement it. The backend is also not required
to set all backend attribute listed. This function is called when
Triton requires further backend / model information to perform operations.
This function may be called multiple times within the lifetime of the
backend (between TRITONBACKEND_Initialize and TRITONBACKEND_Finalize).
The backend may return error to indicate failure to set the backend
attributes, and the attributes specified in the same function call will be
ignored. Triton will update the specified attributes if ‘nullptr’ is
returned.
Get all information about an output tensor by its index. The caller does
not own any of the referenced return values and must not modify or delete
them. The lifetime of all returned values extends until ‘response’ is
deleted.
Get all information about an output tensor by its name. The caller does
not own any of the referenced return values and must not modify or delete
them. The lifetime of all returned values extends until ‘response’ is
deleted.
Initialize a backend. This function is optional, a backend is not
required to implement it. This function is called once when a
backend is loaded to allow the backend to initialize any state
associated with the backend. A backend has a single state that is
shared across all models that use the backend.
Get a buffer holding (part of) the tensor data for an input. For a
given input the number of buffers composing the input are found
from ‘buffer_count’ returned by TRITONBACKEND_InputProperties. The
returned buffer is owned by the input and so should not be
modified or freed by the caller. The lifetime of the buffer
matches that of the input and so the buffer should not be accessed
after the input tensor object is released.
Get the buffer attributes associated with the given input buffer. For a
given input the number of buffers composing the input are found from
‘buffer_count’ returned by TRITONBACKEND_InputProperties. The returned
‘buffer_attributes’ is owned by the input and so should not be modified or
freed by the caller. The lifetime of the ‘buffer_attributes’ matches that of
the input and so the ‘buffer_attributes’ should not be accessed after the
input tensor object is released.
Get a buffer holding (part of) the tensor data for an input for a specific
host policy. If there are no input buffers specified for this host policy,
the fallback input buffer is returned.
For a given input the number of buffers composing the input are found
from ‘buffer_count’ returned by TRITONBACKEND_InputPropertiesForHostPolicy.
The returned buffer is owned by the input and so should not be modified or
freed by the caller. The lifetime of the buffer matches that of the input
and so the buffer should not be accessed after the input tensor object is
released.
Get the name and properties of an input tensor. The returned
strings and other properties are owned by the input, not the
caller, and so should not be modified or freed.
Get the name and properties of an input tensor associated with a given
host policy. If there are no input buffers for the specified host policy,
the properties of the fallback input buffers are returned. The returned
strings and other properties are owned by the input, not the caller, and so
should not be modified or freed.
Allocate a contiguous block of memory of a specific type using a
memory manager. Two error codes have specific interpretations for
this function:
Free a buffer that was previously allocated with
TRITONBACKEND_MemoryManagerAllocate. The call must provide the
same values for ‘memory_type’ and ‘memory_type_id’ as were used
when the buffer was allocate or else the behavior is undefined.
Whether the backend should attempt to auto-complete the model configuration.
If true, the model should fill the inputs, outputs, and max batch size in
the model configuration if incomplete. If the model configuration is
changed, the new configuration must be reported to Triton using
TRITONBACKEND_ModelSetConfig.
Get the backend used by the model.
Callback to be invoked when Triton has finishing forming a batch.
Check whether a request should be added to the pending model batch.
\param userp The placeholder for backend to store and retrieve information
about this pending batch.
\return a TRITONSERVER_Error indicating success or failure.
Free memory associated with batcher. This is called during model unloading.
Create a new batcher for use with custom batching. This is called during
model loading. The batcher will point to a user-defined data structure that
holds read-only data used for custom batching.
Get the model configuration. The caller takes ownership of the
message object and must call TRITONSERVER_MessageDelete to release
the object. The configuration is available via this call even
before the model is loaded and so can be used in
TRITONBACKEND_ModelInitialize. TRITONSERVER_ServerModelConfig
returns equivalent information but is not usable until after the
model loads.
Finalize for a model. This function is optional, a backend is not
required to implement it. This function is called once for a
model, just before the model is unloaded from Triton. All state
associated with the model should be freed and any threads created
for the model should be exited/joined before returning from this
function.
Initialize for a model. This function is optional, a backend is
not required to implement it. This function is called once when a
model that uses the backend is loaded to allow the backend to
initialize any state associated with the model. The backend should
also examine the model configuration to determine if the
configuration is suitable for the backend. Any errors reported by
this function will prevent the model from loading.
Get the device ID of the model instance.
Execute a batch of one or more requests on a model instance. This
function is required. Triton will not perform multiple
simultaneous calls to this function for a given model ‘instance’;
however, there may be simultaneous calls for different model
instances (for the same or different models).
Finalize for a model instance. This function is optional, a
backend is not required to implement it. This function is called
once for an instance, just before the corresponding model is
unloaded from Triton. All state associated with the instance
should be freed and any threads created for the instance should be
exited/joined before returning from this function.
Get the host policy setting. The ‘host_policy’ message is
owned by Triton and should not be modified or freed by the caller.
Initialize for a model instance. This function is optional, a
backend is not required to implement it. This function is called
once when a model instance is created to allow the backend to
initialize any state associated with the instance.
Whether the model instance is passive.
Get the kind of the model instance.
Get the model associated with a model instance.
Get the name of the model instance. The returned string is owned by the
model object, not the caller, and so should not be modified or
freed.
Get the number of optimization profiles to be loaded for the instance.
Get the name of optimization profile. The caller does not own
the returned string and must not modify or delete it. The lifetime
of the returned string extends only as long as ‘instance’.
Record statistics for the execution of an entire batch of
inference requests.
Report the memory usage of the model instance that will be released on
TRITONBACKEND_ModelInstanceFinalize. The backend may call this function
within the lifecycle of the TRITONBACKEND_Model object (between
TRITONBACKEND_ModelInstanceInitialize and
TRITONBACKEND_ModelInstanceFinalize) to report the latest usage. To report
the memory usage of the model, see TRITONBACKEND_ModelReportMemoryUsage.
Record statistics for an inference request.
Get the number of secondary devices configured for the instance.
Get the properties of indexed secondary device. The returned
strings and other properties are owned by the instance, not the
caller, and so should not be modified or freed.
Set the user-specified state associated with the model
instance. The state is completely owned and managed by the
backend.
Get the user-specified state associated with the model
instance. The state is completely owned and managed by the
backend.
Get the name of the model. The returned string is owned by the
model object, not the caller, and so should not be modified or
freed.
Report the memory usage of the model that will be released on
TRITONBACKEND_ModelFinalize. The backend may call this function within the
lifecycle of the TRITONBACKEND_Model object (between
TRITONBACKEND_ModelInitialize and TRITONBACKEND_ModelFinalize) to report the
latest usage. To report the memory usage of a model instance,
see TRITONBACKEND_ModelInstanceReportMemoryUsage.
Get the location of the files that make up the model. The
‘location’ communicated depends on how the model is being
communicated to Triton as indicated by ‘artifact_type’.
Get the TRITONSERVER_Server object that this model is being served
by.
Set the model configuration in Triton server. This API should only be called
when the backend implements the auto-completion of model configuration
and TRITONBACKEND_ModelAutoCompleteConfig returns true in
auto_complete_config. Only the inputs, outputs, max batch size, and
scheduling choice can be changed. A caveat being scheduling choice can only
be changed if none is previously set. Any other changes to the model
configuration will be ignored by Triton. This function can only be called
from TRITONBACKEND_ModelInitialize, calling in any other context will result
in an error being returned. Additionally, Triton server can add some of the
missing fields in the provided config with this call. The backend must get
the complete configuration again by using TRITONBACKEND_ModelConfig.
TRITONBACKEND_ModelSetConfig does not take ownership of the message object
and so the caller should call TRITONSERVER_MessageDelete to release the
object once the function returns.
Set the user-specified state associated with the model. The
state is completely owned and managed by the backend.
Get the user-specified state associated with the model. The
state is completely owned and managed by the backend.
Get the version of the model.
Get a buffer to use to hold the tensor data for the output. The
returned buffer is owned by the output and so should not be freed
by the caller. The caller can and should fill the buffer with the
output data for the tensor. The lifetime of the buffer matches
that of the output and so the buffer should not be accessed after
the output tensor object is released.
Get the buffer attributes associated with the given output buffer. The
returned ‘buffer_attributes’ is owned by the output and so should not be
modified or freed by the caller. The lifetime of the ‘buffer_attributes’
matches that of the output and so the ‘buffer_attributes’ should not be
accessed after the output tensor object is released. This function must be
called after the TRITONBACKEND_OutputBuffer otherwise it might contain
incorrect data.
Get the correlation ID of the request if it is an unsigned integer.
Zero indicates that the request does not have a correlation ID.
Returns failure if correlation ID for given request is not an unsigned
integer.
Get the correlation ID of the request if it is a string.
Empty string indicates that the request does not have a correlation ID.
Returns error if correlation ID for given request is not a string.
Get the flag(s) associated with a request. On return ‘flags’ holds
a bitwise-or of all flag values, see TRITONSERVER_RequestFlag for
available flags.
Get the ID of the request. Can be nullptr if request doesn’t have
an ID. The returned string is owned by the request, not the
caller, and so should not be modified or freed.
Get a named request input. The lifetime of the returned input
object matches that of the request and so the input object should
not be accessed after the request object is released.
Get a request input by index. The order of inputs in a given
request is not necessarily consistent with other requests, even if
the requests are in the same batch. As a result, you can not
assume that an index obtained from one request will point to the
same input in a different request.
Get the number of input tensors specified in the request.
Get the name of an input tensor. The caller does not own
the returned string and must not modify or delete it. The lifetime
of the returned string extends only as long as ‘request’.
Query whether the request is cancelled or not.
Returns the preferred memory type and memory type ID of the output buffer
for the request. As much as possible, Triton will attempt to return
the same memory_type and memory_type_id values that will be returned by
the subsequent call to TRITONBACKEND_OutputBuffer, however, the backend must
be capable of handling cases where the values differ.
Get the number of output tensors requested to be returned in the
request.
Get the name of a requested output tensor. The caller does not own
the returned string and must not modify or delete it. The lifetime
of the returned string extends only as long as ‘request’.
Get a request parameters by index. The order of parameters in a given
request is not necessarily consistent with other requests, even if
the requests are in the same batch. As a result, you can not
assume that an index obtained from one request will point to the
same parameter in a different request.
Get the number of parameters specified in the inference request.
Release the request. The request should be released when it is no
longer needed by the backend. If this call returns with an error
(i.e. non-nullptr) then the request was not released and ownership
remains with the backend. If this call returns with success, the
‘request’ object is no longer owned by the backend and must not be
used. Any tensor names, data types, shapes, input tensors,
etc. returned by TRITONBACKEND_Request* functions for this request
are no longer valid. If a persistent copy of that data is required
it must be created before calling this function.
Get the trace associated with a request. The returned trace is owned by the
request, not the caller, and so should not be modified or freed.
If the request is not being traced, then nullptr
will be returned.
Destroy a response. It is not necessary to delete a response if
TRITONBACKEND_ResponseSend is called as that function transfers
ownership of the response object to Triton.
Destroy a response factory.
Query whether the response factory is cancelled or not.
Create the response factory associated with a request.
Send response flags without a corresponding response.
Create a response for a request.
Create a response using a factory.
Create an output tensor in the response. The lifetime of the
returned output tensor object matches that of the response and so
the output tensor object should not be accessed after the response
object is deleted.
Send a response. Calling this function transfers ownership of the
response object to Triton. The caller must not access or delete
the response object after calling this function.
Set a boolean parameter in the response.
Set an integer parameter in the response.
Set a string parameter in the response.
Get a buffer to use to hold the tensor data for the state. The returned
buffer is owned by the state and so should not be freed by the caller. The
caller can and should fill the buffer with the state data. The buffer must
not be accessed by the backend after TRITONBACKEND_StateUpdate is called.
The caller should fill the buffer before calling TRITONBACKEND_StateUpdate.
Get the buffer attributes associated with the given state buffer.
The returned ‘buffer_attributes’ is owned by the state and so should not be
modified or freed by the caller. The lifetime of the ‘buffer_attributes’
matches that of the state.
Create a state in the request. The returned state object is only valid
before the TRITONBACKEND_StateUpdate is called. The state should not be
freed by the caller. If TRITONBACKEND_StateUpdate is not called, the
lifetime of the state matches the lifetime of the request. If the state name
does not exist in the “state” section of the model configuration, the state
will not be created and an error will be returned. If this function is
called when sequence batching is not enabled or there is no ‘states’ section
in the sequence batching section of the model configuration, this call will
return an error.
Update the state for the sequence. Calling this function will replace the
state stored for this sequence in Triton with ‘state’ provided in the
function argument. If this function is called when sequence batching is not
enabled or there is no ‘states’ section in the sequence batching section of
the model configuration, this call will return an error. The backend is not
required to call this function. If the backend doesn’t call
TRITONBACKEND_StateUpdate function, this particular state for the sequence
will not be updated and the next inference request in the sequence will use
the same state as the current inference request.
Get the TRITONBACKEND API version supported by the Triton shared
library. This value can be compared against the
TRITONSERVER_API_VERSION_MAJOR and TRITONSERVER_API_VERSION_MINOR
used to build the client to ensure that Triton shared library is
compatible with the client.
Get the byte size field of the buffer attributes.
Get the CudaIpcHandle field of the buffer attributes object.
Delete a buffer attributes object.
Get the memory type field of the buffer attributes.
Get the memory type id field of the buffer attributes.
Create a new buffer attributes object. The caller takes ownership of
the TRITONSERVER_BufferAttributes object and must call
TRITONSERVER_BufferAttributesDelete to release the object.
Set the byte size field of the buffer attributes.
Set the CudaIpcHandle field of the buffer attributes.
Set the memory type field of the buffer attributes.
Set the memory type id field of the buffer attributes.
Get the size of a Triton datatype in bytes. Zero is returned for
TRITONSERVER_TYPE_BYTES because it have variable size. Zero is
returned for TRITONSERVER_TYPE_INVALID.
Get the string representation of a data type. The returned string
is not owned by the caller and so should not be modified or freed.
Get the error code.
Get the string representation of an error code. The returned
string is not owned by the caller and so should not be modified or
freed. The lifetime of the returned string extends only as long as
‘error’ and must not be accessed once ‘error’ is deleted.
Delete an error object.
Get the error message. The returned string is not owned by the
caller and so should not be modified or freed. The lifetime of the
returned string extends only as long as ‘error’ and must not be
accessed once ‘error’ is deleted.
Create a new error object. The caller takes ownership of the
TRITONSERVER_Error object and must call TRITONSERVER_ErrorDelete to
release the object.
Get the TRITONSERVER_MetricKind of metric and its corresponding family.
Add an input to a request.
Add a raw input to a request. The name recognized by the model, data type
and shape of the input will be deduced from model configuration.
This function must be called at most once on request with no other input to
ensure the deduction is accurate.
Add an output request to an inference request.
Assign a buffer of data to an input. The buffer will be appended
to any existing buffers for that input. The ‘inference_request’
object takes ownership of the buffer and so the caller should not
modify or free the buffer until that ownership is released by
‘inference_request’ being deleted or by the input being removed
from ‘inference_request’.
Assign a buffer of data to an input. The buffer will be appended
to any existing buffers for that input. The ‘inference_request’
object takes ownership of the buffer and so the caller should not
modify or free the buffer until that ownership is released by
‘inference_request’ being deleted or by the input being removed
from ‘inference_request’.
Assign a buffer of data to an input for execution on all model instances
with the specified host policy. The buffer will be appended to any existing
buffers for that input on all devices with this host policy. The
‘inference_request’ object takes ownership of the buffer and so the caller
should not modify or free the buffer until that ownership is released by
‘inference_request’ being deleted or by the input being removed from
‘inference_request’. If the execution is scheduled on a device that does not
have a input buffer specified using this function, then the input buffer
specified with TRITONSERVER_InferenceRequestAppendInputData will be used so
a non-host policy specific version of data must be added using that API.
\param inference_request The request object.
\param name The name of the input.
\param base The base address of the input data.
\param byte_size The size, in bytes, of the input data.
\param memory_type The memory type of the input data.
\param memory_type_id The memory type id of the input data.
\param host_policy_name All model instances executing with this host_policy
will use this input buffer for execution.
\return a TRITONSERVER_Error indicating success or failure.
Cancel an inference request. Requests are canceled on a best
effort basis and no guarantee is provided that cancelling a
request will result in early termination. Note that the
inference request cancellation status will be reset after
TRITONSERVER_InferAsync is run. This means that if you cancel
the request before calling TRITONSERVER_InferAsync
the request will not be cancelled.
Get the correlation ID of the inference request as an unsigned integer.
Default is 0, which indicates that the request has no correlation ID.
If the correlation id associated with the inference request is a string,
this function will return a failure. The correlation ID is used
to indicate two or more inference request are related to each other.
How this relationship is handled by the inference server is determined by
the model’s scheduling policy.
Get the correlation ID of the inference request as a string.
Default is empty “”, which indicates that the request has no correlation ID.
If the correlation id associated with the inference request is an unsigned
integer, then this function will return a failure. The correlation ID
is used to indicate two or more inference request are related to each other.
How this relationship is handled by the inference server is determined by
the model’s scheduling policy.
Delete an inference request object.
Get the flag(s) associated with a request. On return ‘flags’ holds
a bitwise-or of all flag values, see TRITONSERVER_RequestFlag for
available flags.
Get the ID for a request. The returned ID is owned by
‘inference_request’ and must not be modified or freed by the
caller.
Query whether the request is cancelled or not.
Create a new inference request object.
Deprecated. See TRITONSERVER_InferenceRequestPriorityUInt64 instead.
Get the priority for a request. The default is 0 indicating that
the request does not specify a priority and so will use the
model’s default priority.
Clear all input data from an input, releasing ownership of the
buffer(s) that were appended to the input with
TRITONSERVER_InferenceRequestAppendInputData or
TRITONSERVER_InferenceRequestAppendInputDataWithHostPolicy
\param inference_request The request object.
\param name The name of the input.
Remove all inputs from a request.
Remove all output requests from an inference request.
Remove an input from a request.
Remove an output request from an inference request.
Set a boolean parameter in the request.
Set the correlation ID of the inference request to be an unsigned integer.
Default is 0, which indicates that the request has no correlation ID.
The correlation ID is used to indicate two or more inference request
are related to each other. How this relationship is handled by the
inference server is determined by the model’s scheduling policy.
Set the correlation ID of the inference request to be a string.
The correlation ID is used to indicate two or more inference
request are related to each other. How this relationship is
handled by the inference server is determined by the model’s
scheduling policy.
Set the flag(s) associated with a request. ‘flags’ should hold a
bitwise-or of all flag values, see TRITONSERVER_RequestFlag for
available flags.
Set the ID for a request.
Set an integer parameter in the request.
Deprecated. See TRITONSERVER_InferenceRequestSetPriorityUInt64 instead.
Set the priority for a request. The default is 0 indicating that
the request does not specify a priority and so will use the
model’s default priority.
Set the release callback for an inference request. The release
callback is called by Triton to return ownership of the request
object.
Set the allocator and response callback for an inference
request. The allocator is used to allocate buffers for any output
tensors included in responses that are produced for this
request. The response callback is called to return response
objects representing responses produced for this request.
Set a string parameter in the request.
Set the timeout for a request, in microseconds. The default is 0
which indicates that the request has no timeout.
Get the timeout for a request, in microseconds. The default is 0
which indicates that the request has no timeout.
Delete an inference response object.
Return the error status of an inference response. Return a
TRITONSERVER_Error object on failure, return nullptr on success.
The returned error object is owned by ‘inference_response’ and so
should not be deleted by the caller.
Get the ID of the request corresponding to a response. The caller
does not own the returned ID and must not modify or delete it. The
lifetime of all returned values extends until ‘inference_response’
is deleted.
Get model used to produce a response. The caller does not own the
returned model name value and must not modify or delete it. The
lifetime of all returned values extends until ‘inference_response’
is deleted.
Get all information about an output tensor. The tensor data is
returned as the base pointer to the data and the size, in bytes,
of the data. The caller does not own any of the returned values
and must not modify or delete them. The lifetime of all returned
values extends until ‘inference_response’ is deleted.
Get a classification label associated with an output for a given
index. The caller does not own the returned label and must not
modify or delete it. The lifetime of all returned label extends
until ‘inference_response’ is deleted.
Get the number of outputs available in the response.
Get all information about a parameter. The caller does not own any
of the returned values and must not modify or delete them. The
lifetime of all returned values extends until ‘inference_response’
is deleted.
Get the number of parameters available in the response.
Get the string representation of a trace activity. The returned
string is not owned by the caller and so should not be modified or
freed.
Delete a trace object.
Get the id associated with a trace. Every trace is assigned an id
that is unique across all traces created for a Triton server.
Get the string representation of a trace level. The returned
string is not owned by the caller and so should not be modified or
freed.
Get the name of the model associated with a trace. The caller does
not own the returned string and must not modify or delete it. The
lifetime of the returned string extends only as long as ‘trace’.
Get the version of the model associated with a trace.
Create a new inference trace object. The caller takes ownership of
the TRITONSERVER_InferenceTrace object and must call
TRITONSERVER_InferenceTraceDelete to release the object.
Get the parent id associated with a trace. The parent id indicates
a parent-child relationship between two traces. A parent id value
of 0 indicates that there is no parent trace.
Get the request id associated with a trace. The caller does
not own the returned string and must not modify or delete it. The
lifetime of the returned string extends only as long as ‘trace’.
Get the child trace, spawned from the parent trace. The caller owns
the returned object and must call TRITONSERVER_InferenceTraceDelete
to release the object, unless ownership is transferred through
other APIs (see TRITONSERVER_ServerInferAsync).
Create a new inference trace object. The caller takes ownership of
the TRITONSERVER_InferenceTrace object and must call
TRITONSERVER_InferenceTraceDelete to release the object.
Get the string representation of an instance-group kind. The
returned string is not owned by the caller and so should not be
modified or freed.
Is a log level enabled?
Log a message at a given log level if that level is enabled.
Get the string representation of a memory type. The returned
string is not owned by the caller and so should not be modified or
freed.
Delete a message object.
Create a new message object from serialized JSON string.
Get the base and size of the buffer containing the serialized
message in JSON format. The buffer is owned by the
TRITONSERVER_Message object and should not be modified or freed by
the caller. The lifetime of the buffer extends only as long as
‘message’ and must not be accessed once ‘message’ is deleted.
Delete a metric object.
All TRITONSERVER_Metric* objects should be deleted BEFORE their
corresponding TRITONSERVER_MetricFamily* objects have been deleted.
If a family is deleted before its metrics, an error will be returned.
Delete a metric family object.
A TRITONSERVER_MetricFamily* object should be deleted AFTER its
corresponding TRITONSERVER_Metric* objects have been deleted.
Attempting to delete a family before its metrics will return an error.
Create a new metric family object. The caller takes ownership of the
TRITONSERVER_MetricFamily object and must call
TRITONSERVER_MetricFamilyDelete to release the object.
Increment the current value of metric by value.
Supports metrics of kind TRITONSERVER_METRIC_KIND_GAUGE for any value,
and TRITONSERVER_METRIC_KIND_COUNTER for non-negative values. Returns
TRITONSERVER_ERROR_UNSUPPORTED for unsupported TRITONSERVER_MetricKind
and TRITONSERVER_ERROR_INVALID_ARG for negative values on a
TRITONSERVER_METRIC_KIND_COUNTER metric.
Create a new metric object. The caller takes ownership of the
TRITONSERVER_Metric object and must call
TRITONSERVER_MetricDelete to release the object. The caller is also
responsible for ownership of the labels passed in. Each label can be deleted
immediately after creating the metric with TRITONSERVER_ParameterDelete
if not re-using the labels.
Set the current value of metric to value.
Supports metrics of kind TRITONSERVER_METRIC_KIND_GAUGE and returns
TRITONSERVER_ERROR_UNSUPPORTED for unsupported TRITONSERVER_MetricKind.
Get the current value of a metric object.
Supports metrics of kind TRITONSERVER_METRIC_KIND_COUNTER
and TRITONSERVER_METRIC_KIND_GAUGE, and returns
TRITONSERVER_ERROR_UNSUPPORTED for unsupported TRITONSERVER_MetricKind.
Delete a metrics object.
Get a buffer containing the metrics in the specified format. For
each format the buffer contains the following:
Create a new parameter object with type TRITONSERVER_PARAMETER_BYTES.
The caller takes ownership of the TRITONSERVER_Parameter object and must
call TRITONSERVER_ParameterDelete to release the object. The object only
maintains a shallow copy of the ‘byte_ptr’ so the data content must be
valid until the parameter object is deleted.
Delete an parameter object.
Create a new parameter object. The caller takes ownership of the
TRITONSERVER_Parameter object and must call TRITONSERVER_ParameterDelete to
release the object. The object will maintain its own copy of the ‘value’
Get the string representation of a parameter type. The returned
string is not owned by the caller and so should not be modified or
freed.
Delete a response allocator.
Create a new response allocator object.
Set the buffer attributes function for a response allocator object.
The function will be called after alloc_fn to set the buffer attributes
associated with the output buffer.
Set the query function to a response allocator object. Usually the
function will be called before alloc_fn to understand what is the
allocator’s preferred memory type and memory type ID at the current
situation to make different execution decision.
Delete a server object. If server is not already stopped it is
stopped before being deleted.
Perform inference using the meta-data and inputs supplied by the
‘inference_request’. If the function returns success, then the
caller releases ownership of ‘inference_request’ and must not
access it in any way after this call, until ownership is returned
via the ‘request_release_fn’ callback registered in the request
object with TRITONSERVER_InferenceRequestSetReleaseCallback.
Is the server live?
Is the server ready?
Load the requested model or reload the model if it is already
loaded. The function does not return until the model is loaded or
fails to load. Returned error indicates if model loaded
successfully or not.
Load the requested model or reload the model if it is already
loaded, with load parameters provided. The function does not return until
the model is loaded or fails to load. Returned error indicates if model
loaded successfully or not.
Currently the below parameter names are recognized:
Get the metadata of the server as a TRITONSERVER_Message object.
The caller takes ownership of the message object and must call
TRITONSERVER_MessageDelete to release the object.
Get the current metrics for the server. The caller takes ownership
of the metrics object and must call TRITONSERVER_MetricsDelete to
release the object.
Get the batch properties of the model. The properties are
communicated by a flags value and an (optional) object returned by
‘voidp’.
Get the configuration of a model as a TRITONSERVER_Message object.
The caller takes ownership of the message object and must call
TRITONSERVER_MessageDelete to release the object.
Get the index of all unique models in the model repositories as a
TRITONSERVER_Message object. The caller takes ownership of the
message object and must call TRITONSERVER_MessageDelete to release
the object.
Is the model ready?
Get the metadata of a model as a TRITONSERVER_Message
object. The caller takes ownership of the message object and must
call TRITONSERVER_MessageDelete to release the object.
Get the statistics of a model as a TRITONSERVER_Message
object. The caller takes ownership of the object and must call
TRITONSERVER_MessageDelete to release the object.
Get the transaction policy of the model. The policy is
communicated by a flags value.
Create a new server object. The caller takes ownership of the
TRITONSERVER_Server object and must call TRITONSERVER_ServerDelete
to release the object.
Add resource count for rate limiting.
Delete a server options object.
Create a new server options object. The caller takes ownership of
the TRITONSERVER_ServerOptions object and must call
TRITONSERVER_ServerOptionsDelete to release the object.
Set a configuration setting for a named backend in a server
options.
Set the directory containing backend shared libraries. This
directory is searched last after the version and model directory
in the model repository when looking for the backend shared
library for a model. If the backend is named ‘be’ the directory
searched is ‘backend_dir’/be/libtriton_be.so.
Set the number of threads used in buffer manager in a server options.
Set the cache config that will be used to initialize the cache
implementation for “cache_name”.
Set the directory containing cache shared libraries. This
directory is searched when looking for cache implementations.
Enable or disable CPU metrics collection in a server options. CPU
metrics are collected if both this option and
TRITONSERVER_ServerOptionsSetMetrics are true.
Set the total CUDA memory byte size that the server can allocate
on given GPU device in a server options. The pinned memory pool
will be shared across Triton itself and the backends that use
TRITONBACKEND_MemoryManager to allocate memory.
Enable or disable exit-on-error in a server options.
Set the exit timeout, in seconds, for the server in a server
options.
Enable or disable GPU metrics collection in a server options. GPU
metrics are collected if both this option and
TRITONSERVER_ServerOptionsSetMetrics are true.
Set a host policy setting for a given policy name in a server options.
Enable or disable error level logging.
Provide a log output file.
Set the logging format.
Enable or disable info level logging.
Set verbose logging level. Level zero disables verbose logging.
Enable or disable warning level logging.
Enable or disable metrics collection in a server options.
Set a configuration setting for metrics in server options.
Set the interval for metrics collection in a server options.
This is 2000 milliseconds by default.
Set the minimum support CUDA compute capability in a server
options.
Set the model control mode in a server options. For each mode the models
will be managed as the following:
Specify the limit on memory usage as a fraction on the device identified by
‘kind’ and ‘device_id’. If model loading on the device is requested and the
current memory usage exceeds the limit, the load will be rejected. If not
specified, the limit will not be set.
Set the number of threads to concurrently load models in a server options.
Enable model namespacing to allow serving models with the same name if
they are in different namespaces.
Set the model repository path in a server options. The path must be
the full absolute path to the model repository. This function can be called
multiple times with different paths to set multiple model repositories.
Note that if a model is not unique across all model repositories
at any time, the model will not be available.
Set the total pinned memory byte size that the server can allocate
in a server options. The pinned memory pool will be shared across
Triton itself and the backends that use
TRITONBACKEND_MemoryManager to allocate memory.
Set the rate limit mode in a server options.
Set the directory containing repository agent shared libraries. This
directory is searched when looking for the repository agent shared
library for a model. If the repo agent is named ‘ra’ the directory
searched is ‘repoagent_dir’/ra/libtritonrepoagent_ra.so.
Deprecated. See TRITONSERVER_ServerOptionsSetCacheConfig instead.
Set the textual ID for the server in a server options. The ID is a
name that identifies the server.
Set the model to be loaded at startup in a server options. The model must be
present in one, and only one, of the specified model repositories.
This function can be called multiple times with different model name
to set multiple startup models.
Note that it only takes affect on TRITONSERVER_MODEL_CONTROL_EXPLICIT mode.
Enable or disable strict model configuration handling in a server
options.
Enable or disable strict readiness handling in a server options.
Check the model repository for changes and update server state
based on those changes.
Register a new model repository. Not available in polling mode.
Stop a server object. A server can’t be restarted once it is
stopped.
Unload the requested model. Unloading a model that is not loaded
on server has no affect and success code will be returned.
The function does not wait for the requested model to be fully unload
and success code will be returned.
Returned error indicates if model unloaded successfully or not.
Unload the requested model, and also unload any dependent model that
was loaded along with the requested model (for example, the models composing
an ensemble). Unloading a model that is not loaded
on server has no affect and success code will be returned.
The function does not wait for the requested model and all dependent
models to be fully unload and success code will be returned.
Returned error indicates if model unloaded successfully or not.
Unregister a model repository. Not available in polling mode.
Get the Triton datatype corresponding to a string representation
of a datatype.