Agent evaluations

Agent evaluations test and measure AI agent quality before and after deployment. You get data on whether your agents deliver accurate, complete, and concise responses.

Evaluations test agents against predefined questions with known correct answers. You provide the questions and expected answers, and the system measures how well your agent performs.

Benefits of evaluations:

Verify agent performance before and after deployment
Get measurable ratings instead of subjective opinions
Track quality changes as you update knowledge bases and prompts
Maintain audit trails for regulatory requirements

Datasets

Datasets are collections of test questions with reference answers.

Cover representative questions your agent will receive. Include clear, accurate reference answers and edge cases. Start with 10 question-answer pairs minimum. 20-50 pairs work better.

Example dataset structure

Question: "How do I reset my password?"
Reference Answer: "Click 'Forgot Password' on the login page, enter your email, and follow the reset link sent to your inbox."

To create a dataset, navigate to the evaluation service (under Services > Evaluations), provide a name and description, add questions and expected answers, then save.

Dataset overview showing all evaluation datasets with creation dates

Creating a Dataset Adding test questions with expected answers

TIP

Start with 20-30 questions covering simple and complex scenarios. Update datasets as your agent evolves. Organize by topic or use case.

Running experiments

Experiments test your agent against a dataset and produce quality ratings.

To run an experiment, select an agent, pick a test dataset, start the experiment, then review the ratings and analysis.

Creating an experiment by selecting an agent and dataset

Experiment overview listing past experiments - click for details

Running an Experiment Experiment progress during execution

Run experiments before deploying a new agent, after making significant changes to configuration or knowledge base, and regularly (weekly or monthly) for quality monitoring.

How AI judges work

Each question gets sent to your agent. Three AI judges (LLMs) evaluate the response against the reference answer. Results are averaged and displayed as star ratings.

Evaluation metrics

Three metrics, rated 0-5 stars:

Metric	Description	Rating Guide
Correctness	Factual accuracy compared to the reference answer. Free of misinformation, hallucinations, or contradictions.	5: Matches reference 3: Some errors 0: Wrong/misleading
Completeness	Addresses all parts of the query, including multi-part questions and implicit needs.	5: All parts answered 3: Some aspects missed 0: Incomplete
Conciseness	Efficient and direct. Avoids irrelevant tangents, redundancy, or excessive filler.	5: To the point 3: Verbose 0: Excessive

Rating ranges:

4-5 stars: Ready for production.
3-4 stars: Works well but may have minor issues. Review failing test cases.
Below 3 stars: Review closely before deployment.

Viewing results

Experiment results showing overall metrics and detailed breakdown per question

The results page shows star ratings for the three metrics at the top. Below is a table with each test question, reference answer, agent's response, ratings, and response latency.

Expand questions to see full text. Low correctness ratings usually mean knowledge base gaps or retrieval issues. Low completeness ratings suggest the agent misses parts of multi-part questions. Low conciseness ratings mean overly verbose responses.

Update your agent's knowledge base, system prompts, or retrieval settings based on results. Run the experiment again to verify improvements.

TIP

Langfuse can be accessed for deeper investigation, including conversation traces, cost attribution, and raw telemetry data. In development, access Langfuse at http://localhost:6006.

What's not implemented

The following features are not currently implemented:

Bias monitoring and model drift detection: No automated bias detection, fairness metrics, or drift detection. The evaluation framework and OpenTelemetry tracing provide foundational capabilities that could be extended.
Production A/B testing: No integrated traffic splitting or parallel testing of agent variants in production. Pre-deployment comparison via experiments is supported.

Monitoring & Alerting

Identity Provider Setup

Microsoft Entra ID

Sources

Agent evaluations ​

Datasets ​

Running experiments ​

How AI judges work ​

Evaluation metrics ​

Viewing results ​

What's not implemented ​