August 6, 2024

Education

No items found.

Eval-driven development for the AI engineer

August 6, 2024

Julia Neagu

August 6, 2024

Education

No items found.

Eval-driven development for the AI engineer

August 6, 2024

Julia Neagu

August 6, 2024

Education

No items found.

Eval-driven development for the AI engineer

August 6, 2024

Julia Neagu

August 6, 2024

Education

No items found.

Eval-driven development for the AI engineer

August 6, 2024

Julia Neagu

Back to all posts

While there is currently an explosion of developers working in the AI space, only 5% of LLM-based projects lead to significant monetization of new product offerings. Many products remain stuck in the proof-of-concept or beta phase and never make their way to users.

So, what’s going wrong?

In this article, we discuss how developing AI products differs from traditional software development and why reliable evaluations are the key for consistently shipping them.
‍

LLM development is challenging existing software development paradigms

The gap between the boom in AI development and the largely underwhelming return on investment from AI products in production can be explained in a few ways. LLM development diverges significantly from traditional software development paradigms in ways that can catch engineers by surprise, especially those who have less experience working with distributed systems — where stochasticity is more common.

In most traditional software development, engineers can reliably cultivate ideas into production-ready features by iterating on nascent functionalities. Testing and debugging plays a huge role in that process.

But LLMs have shifted this paradigm.

While AI engineers may be able to move from concept to initial feature faster than traditional software developers due to the lower barrier to adopting AI models and tooling, setting up tests (a.k.a. evaluations) for these features is significantly more challenging due to the non-deterministic behavior of LLMs.

Not only that, but users' interactions with the feature—such as inputs, iterations on previous results, and contextual knowledge—further complicate the feature and its production-readiness testing. Given the vast space of possible scenarios once in production, even measuring well-known software development metrics like test coverage becomes a challenging task.
‍

When benchmarks are holding us back

The prevailing method of evaluating LLM quality is to score models on benchmarks, which assess generated text across several tasks. However, these benchmarks often fail to predict the likelihood of an LLM producing accurate and valid responses in production. This leads to a convoluted iteration process with unclear results, causing many features to fail to reach production-readiness. Developers struggle to identify why features are failing and what changes are needed to make them production-ready.

When evaluations can propel us forward

To consistently ship reliable AI products, the AI engineers of the present must reclaim and adapt trusted software development paradigms like test driven development. Some experts have called this shift evaluation-driven development.

There are lots of reasons developers avoid investing time in traditional software testing, be it manual, unit, or integration testing—for instance, high perceived complexity and necessary time investment, expected costs of maintaining test suites, or lack of experience writing good test suites. While these issues are also cited by AI developers, the unpredictable behavior and inherent randomness of LLMs makes having good evaluation frameworks a “do it now or regret it later” issue.

A good evaluation set for LLMs meets these five criteria:

Realistic: Evaluations should accurately reflect actual production scenarios, testing products on the expected user-facing scenarios.
Aligned: Evaluations must correlate with human judgment. For example, a customer support AI product should be graded on the same scale as a human agent's responses.
Comprehensive: Evaluations should test a wide range of production scenarios due to the non-deterministic nature of model responses and the complexity added by user interactions. Accounting for this wide variety of possibilities is not comparable to, for example, implementing a UI change, which everyone will use the same way. Evals should cover both expected 'happy paths' and critical 'sad paths' — those edge cases and failure scenarios that could lead to significant issues. Ultimately, the space of possible inputs and outputs from an LLM is infinite, and we need to think actively about good scenario coverage —let’s call this “eval coverage”.
Reproducible: Evaluations should produce consistent results under unchanged conditions, ensuring reliable measurement. This means unless something about the system has changed (e.g. the LLM model, the prompts, the RAG system), the results of the evals should be —at a high level— the same. Rerunning the same evals should not produce wildly different results and change tests’ conclusions about the system.
Secret: Models should not be trained on evaluation datasets to prevent gaming the evaluations. Or, the other way around, models should not be tested on data that is likely part of the training set. Passing evaluations should indicate true usability, not just trained compliance.
‍

Build AI evaluations with real data andhuman feedback

We have discovered that, in order to build evaluations that meet these criteria, a combination of automatic evaluations and human-in-the-loop feedback is necessary, starting from real data.

First, automatic evaluations are necessary due to the vast number of potential scenarios an LLM must be tested on, making manual generation impractical. It would be prohibitively expensive and time-consuming to pay humans to manually generate the thousands of datapoints required for comprehensive testing. This is also part of the reason we see features stuck in the pre-production phase for so long.

Additionally, human feedback is an essential component to evaluation development because it ensures evaluations reflect dynamic end-user concerns, which cannot be fully captured by automatic methods alone. There is a high amount of subjectivity when it comes to deciding what is good, and what isn't.

Finally, starting from real data ensures evaluations are grounded in the specific use cases being addressed. For instance, the evaluation requirements for a customer support feature will differ from those for a recommendation engine (think: tone, contextual understanding etc.), even if they address similar challenges.
‍

Our approach

Quotient AI automates manual evaluations starting from real data, incorporating human feedback.

By using domain-specific data, Quotient ensures evaluations are realistic, aligned, comprehensive, reproducible, and secret. Data from your warehouse, such as product logs, customer chats, and documents, seeds evaluation datasets tailored to your use cases and users. Through iterative processes with human feedback and AI-assisted generation, you get proprietary datasets that comprehensively test your LLM products.

If you’re tackling similar challenges, we’d love to hear from you at contact@quotientai.co.

Note: Contents in this article have been previously presented at the 2024 Databricks Data + AI Summit. A recording of that talk can be found here.

‍

While there is currently an explosion of developers working in the AI space, only 5% of LLM-based projects lead to significant monetization of new product offerings. Many products remain stuck in the proof-of-concept or beta phase and never make their way to users.

So, what’s going wrong?

In this article, we discuss how developing AI products differs from traditional software development and why reliable evaluations are the key for consistently shipping them.
‍

LLM development is challenging existing software development paradigms

The gap between the boom in AI development and the largely underwhelming return on investment from AI products in production can be explained in a few ways. LLM development diverges significantly from traditional software development paradigms in ways that can catch engineers by surprise, especially those who have less experience working with distributed systems — where stochasticity is more common.

In most traditional software development, engineers can reliably cultivate ideas into production-ready features by iterating on nascent functionalities. Testing and debugging plays a huge role in that process.

But LLMs have shifted this paradigm.

While AI engineers may be able to move from concept to initial feature faster than traditional software developers due to the lower barrier to adopting AI models and tooling, setting up tests (a.k.a. evaluations) for these features is significantly more challenging due to the non-deterministic behavior of LLMs.

Not only that, but users' interactions with the feature—such as inputs, iterations on previous results, and contextual knowledge—further complicate the feature and its production-readiness testing. Given the vast space of possible scenarios once in production, even measuring well-known software development metrics like test coverage becomes a challenging task.
‍

When benchmarks are holding us back

The prevailing method of evaluating LLM quality is to score models on benchmarks, which assess generated text across several tasks. However, these benchmarks often fail to predict the likelihood of an LLM producing accurate and valid responses in production. This leads to a convoluted iteration process with unclear results, causing many features to fail to reach production-readiness. Developers struggle to identify why features are failing and what changes are needed to make them production-ready.

When evaluations can propel us forward

To consistently ship reliable AI products, the AI engineers of the present must reclaim and adapt trusted software development paradigms like test driven development. Some experts have called this shift evaluation-driven development.

There are lots of reasons developers avoid investing time in traditional software testing, be it manual, unit, or integration testing—for instance, high perceived complexity and necessary time investment, expected costs of maintaining test suites, or lack of experience writing good test suites. While these issues are also cited by AI developers, the unpredictable behavior and inherent randomness of LLMs makes having good evaluation frameworks a “do it now or regret it later” issue.

A good evaluation set for LLMs meets these five criteria:

Realistic: Evaluations should accurately reflect actual production scenarios, testing products on the expected user-facing scenarios.
Aligned: Evaluations must correlate with human judgment. For example, a customer support AI product should be graded on the same scale as a human agent's responses.
Comprehensive: Evaluations should test a wide range of production scenarios due to the non-deterministic nature of model responses and the complexity added by user interactions. Accounting for this wide variety of possibilities is not comparable to, for example, implementing a UI change, which everyone will use the same way. Evals should cover both expected 'happy paths' and critical 'sad paths' — those edge cases and failure scenarios that could lead to significant issues. Ultimately, the space of possible inputs and outputs from an LLM is infinite, and we need to think actively about good scenario coverage —let’s call this “eval coverage”.
Reproducible: Evaluations should produce consistent results under unchanged conditions, ensuring reliable measurement. This means unless something about the system has changed (e.g. the LLM model, the prompts, the RAG system), the results of the evals should be —at a high level— the same. Rerunning the same evals should not produce wildly different results and change tests’ conclusions about the system.
Secret: Models should not be trained on evaluation datasets to prevent gaming the evaluations. Or, the other way around, models should not be tested on data that is likely part of the training set. Passing evaluations should indicate true usability, not just trained compliance.
‍

Build AI evaluations with real data and human feedback

We have discovered that, in order to build evaluations that meet these criteria, a combination of automatic evaluations and human-in-the-loop feedback is necessary, starting from real data.

First, automatic evaluations are necessary due to the vast number of potential scenarios an LLM must be tested on, making manual generation impractical. It would be prohibitively expensive and time-consuming to pay humans to manually generate the thousands of datapoints required for comprehensive testing. This is also part of the reason we see features stuck in the pre-production phase for so long.

Additionally, human feedback is an essential component to evaluation development because it ensures evaluations reflect dynamic end-user concerns, which cannot be fully captured by automatic methods alone. There is a high amount of subjectivity when it comes to deciding what is good, and what isn't.

Finally, starting from real data ensures evaluations are grounded in the specific use cases being addressed. For instance, the evaluation requirements for a customer support feature will differ from those for a recommendation engine (think: tone, contextual understanding etc.), even if they address similar challenges.
‍

Our approach

Quotient AI automates manual evaluations starting from real data, incorporating human feedback.

By using domain-specific data, Quotient ensures evaluations are realistic, aligned, comprehensive, reproducible, and secret. Data from your warehouse, such as product logs, customer chats, and documents, seeds evaluation datasets tailored to your use cases and users. Through iterative processes with human feedback and AI-assisted generation, you get proprietary datasets that comprehensively test your LLM products.

If you’re tackling similar challenges, we’d love to hear from you at contact@quotientai.co.

Note: Contents in this article have been previously presented at the 2024 Databricks Data + AI Summit. A recording of that talk can be found here.