Early adopters of generative AI solutions have focused on transforming how businesses interact with their customers. While there have been a number of successes, the prevailing stories are of companies deploying LLM-powered tools prematurely, with significant negative consequences for themselves and their customers. These latter examples highlight over and over the risks associated with the lack of comprehensive, real-world, domain-specific testing for generative AI products.
Good evaluations must replicate real product usage as closely as possible. Benchmarks can be a good starting point to begin evaluating LLMs, but such tests will fail to capture the real-world performance of a generative AI product.
Realistic, domain-specific evaluation is the single most impactful step AI developers can take to ensure their products are suited for real-world applications and reduce deployment risks.
Quotient’s tools enable AI developers to rapidly build evaluation frameworks that account for their particular tasks, domains, and even organizational knowledge, and make decisions that are actually correlated with the ultimate product performance.
In this blog, we compared benchmark evaluation to domain-specific evaluation using Quotient’s platform, starting with one of the most ubiquitous enterprise generative AI use-cases: customer support agent augmentation.
Here’s what we found:
1️⃣ Evaluating on benchmarks selects the wrong models for domain-specific tasks.
2️⃣ Open source LLMs can outperform proprietary models on domain-specific benchmarks.
3️⃣ Benchmarks can overestimate the risk of hallucinations by 15x.
And here’s how we got to these results:
We generated a domain-specific evaluation dataset for customer support
We opted for a reference-based evaluation setup. Here, the quality metrics of the LLM system are calculated by comparing its outputs to those from a reference dataset.
To ensure representative and realistic evaluations, we used Quotient to generate synthetic datasets for customer support use-cases, starting from a seed dataset of existing customer support logs.
For this experiment, we chose a customer support summarization dataset, containing 95 realistic chat conversations between a customer and a support agent, including a diverse range of customer issues, product types, and chat sentiments.