Function for evaluating LLM response and generating metrics.
The primary use case for evaluate_llm is to take a list of data samples, which often contain inputs and outputs
from LLM systems and evaluate them against user-defined metrics in a LLM as a judge pipeline. The user is expected provide
a list of dict objects and a list of LLMEval metrics. These eval metrics will be used to create a workflow, which is then
executed in an async context. All eval scores are extracted and returned to the user.