Everyone wants to build LLM apps, but how do you know if the “improvements” you are making are actually making the product better?? You could just go the route of trial and error, but quantifying improvement and consistency ends up being a major pain point. If you’re from a software background, maybe you are thinking of something like unit tests. If you’re an academic, benchmarks may be your standard.
While appealing, these methods introduce significant issues, such as complexity, dynamism, nondeterminism, and diversity of the tasks and domains. In this blogpost, we propose a new paradigm for evaluating LLMs: metrics driven development (MDD). MDD is analogous to test driven development (TDD), a popular paradigm in software engineering, but differs from it in some key ways. We will explain how MDD works and why it is more suitable for LLMs than TDD. In a future post, we will discuss how to actually implement a robust evaluation framework.
It’s immediately appealing to apply a well-known unit testing framework to LLMs. It generally goes along the lines of we (the dev) know the cases we want to test and once we write comprehensive tests, we have confidence in the output. In TDD, the tests are canonically written before the code to force the expected behavior to be thought through. Also, people tend to write better tests when they are at the beginning than as an afterthought at the end. TDD has many benefits, such as providing clarity, direction, and quality assurance for the software development.
When we try to use TDD for LLMs, it doesn’t work. The LLM may perform well on the test cases but do poorly on unseen cases or it’s not capturing the capabilities you want to measure. Why?
There are a few key reasons:
- Nondeterminism: When given identical input, an LLM may differ in its output each time, leading to difficulty in any exact match.
- Fuzziness of correct behavior: There are generally degrees of goodness that are not captured when just testing for a particular solution. For instance, there is a very definite answer to 1+1, but saying Yosemite is a beautiful park versus Kings Canyon, is a more subjective notion.
- Complexity: LLMs are often tasked with very open-ended and complex jobs which makes explicit enumeration of expected behavior intractable. LLMs are also much more complex and obfuscated when compared to traditional functions leading to necessarily more evaluation points.
These differences imply that LLMs require a wider definition of “good” or “correct” and broader sets of tests than software. Therefore we introduce Metrics Driven Development (MDD).
Introducing Metrics Driven Development
What are metrics?
Metrics are a quantitative measure of an aspect of a system. With LLMs, generally we need to define them to point towards the behavior that we desire. For example, some common metrics for LLMs are accuracy, fluency, coherence, relevance, diversity, etc. Designing metrics is an introspective, time consuming process, but a critical one. It will define the desired outcome of your LLM app and help you reach it.
What is MDD?
MDD is similar to TDD in the sense that it also scopes the expected behavior and functionality of the product, sets goals, and measures progress. Instead of using tests, however, MDD uses metrics.
LLM metrics differ from traditional tests by covering more points and having notions of similarity to answers. For instance, an addition function could have unit tests to verify whether 1+1=2 or -4+4=0. If the system passes these tests, it likely works because its behavior is fairly scoped and the complexity of the function is low.
Now suppose there is an LLM that recommends national parks to go to given a user’s interests. Evaluating it with a test case doesn’t work because suggesting Yosemite when the unit test was looking for Kings Canyon leads to a “failed” test, despite Yosemite being a good recommendation.
Consider instead a metric for this national parks recommendation system of “usefulness.” Instead of grading each response as “true” or “false,” the metric instead scores the response based on how “reasonable” the recommendation is given the specific user persona. Now, the system may rate both Kings Canyon and Yosemite with high “usefulness” scores whereas it would grade the Everglades with a lower score, due to it having a totally different topography.
The crux of the MDD flow is:
- Creation of metric idea
- Sourcing some examples
- Creation of the grading rubric (grading prompt)
- Using LLMs to grade the LLM Service
- Refinement of the metric by iteration
MDD solves each of the issues faced in traditional LLM evaluation and allows for a much truer picture of progress. Nondeterminism and fuzziness are accounted for by moving away from exact match to fuzzy match which is implemented by LLMs evaluating LLMs. Complexity is accounted for by having sets that are far more statistical and comprehensive in nature than unit tests. Your metric should never saturate at 100% but traditionally you expect 100% of your unit tests to pass.
The metrics can be either reference labeled or not. Each approach has tradeoffs, which will be discussed in the next blog. Stay tuned!