When evaluations can propel us forward
To consistently ship reliable AI products, the AI engineers of the present must reclaim and adapt trusted software development paradigms like test driven development. Some experts have called this shift evaluation-driven development.
There are lots of reasons developers avoid investing time in traditional software testing, be it manual, unit, or integration testing—for instance, high perceived complexity and necessary time investment, expected costs of maintaining test suites, or lack of experience writing good test suites. While these issues are also cited by AI developers, the unpredictable behavior and inherent randomness of LLMs makes having good evaluation frameworks a “do it now or regret it later” issue.
A good evaluation set for LLMs meets these five criteria:
- Realistic: Evaluations should accurately reflect actual production scenarios, testing products on the expected user-facing scenarios.
- Aligned: Evaluations must correlate with human judgment. For example, a customer support AI product should be graded on the same scale as a human agent's responses.
- Comprehensive: Evaluations should test a wide range of production scenarios due to the non-deterministic nature of model responses and the complexity added by user interactions. Accounting for this wide variety of possibilities is not comparable to, for example, implementing a UI change, which everyone will use the same way. Evals should cover both expected 'happy paths' and critical 'sad paths' — those edge cases and failure scenarios that could lead to significant issues. Ultimately, the space of possible inputs and outputs from an LLM is infinite, and we need to think actively about good scenario coverage —let’s call this “eval coverage”.
- Reproducible: Evaluations should produce consistent results under unchanged conditions, ensuring reliable measurement. This means unless something about the system has changed (e.g. the LLM model, the prompts, the RAG system), the results of the evals should be —at a high level— the same. Rerunning the same evals should not produce wildly different results and change tests’ conclusions about the system.
- Secret: Models should not be trained on evaluation datasets to prevent gaming the evaluations. Or, the other way around, models should not be tested on data that is likely part of the training set. Passing evaluations should indicate true usability, not just trained compliance.
Build AI evaluations with real data and human feedback
We have discovered that, in order to build evaluations that meet these criteria, a combination of automatic evaluations and human-in-the-loop feedback is necessary, starting from real data.
First, automatic evaluations are necessary due to the vast number of potential scenarios an LLM must be tested on, making manual generation impractical. It would be prohibitively expensive and time-consuming to pay humans to manually generate the thousands of datapoints required for comprehensive testing. This is also part of the reason we see features stuck in the pre-production phase for so long.
Additionally, human feedback is an essential component to evaluation development because it ensures evaluations reflect dynamic end-user concerns, which cannot be fully captured by automatic methods alone. There is a high amount of subjectivity when it comes to deciding what is good, and what isn't.
Finally, starting from real data ensures evaluations are grounded in the specific use cases being addressed. For instance, the evaluation requirements for a customer support feature will differ from those for a recommendation engine (think: tone, contextual understanding etc.), even if they address similar challenges.
Our approach
Quotient AI automates manual evaluations starting from real data, incorporating human feedback.