Mastering the Unpredictable: how to Evaluate your GenAI Chatbots.

Rabobank

The evaluation of chatbots and generative AI in general is a big challenge. Ideally, the answers from a chatbot could be compared to ‘good’ or correct answers to determine the quality, but these ground-truths are time-consuming to write and even then still difficult to compare. At the same time, a proper evaluation framework is the key to creating genAI products that can be trusted.
At Rabobank, we are working on an LLM-based approach for evaluation to create safe and compliant chatbots. Our Responsible GenAI Toolkit provides metrics to measure concepts such as hallucination, correctness and completeness. In this presentation, we give an overview of the evaluation approach we take:
• What are the considerations and challenges for evaluating a chatbot?
• How did we decide on our approach?
• What is the comparison with other common packages?
• How do we determine whether our metrics are actually measuring what they should?
We believe that this will be interesting for anyone using genAI and especially those interested in bringing chatbots to production. As data scientists, we like to work with numerical evidence and concrete proofs. So how can we impose that rigor on a field that sometimes feels unpredictable?

Presentation block 5