Evaluating Large Language Model (LLM) systems_ Metrics, challenges, and best practices _ by Jane Huang _ Data Science at Microsoft _ Mar, 2024 _ Medium
Evaluating Large Language Model (LLM) systems_ Metrics, challenges, and best practices _ by Jane Huang _ Data Science at Microsoft _ Mar, 2024 _ Medium
Find a home for your writing: Join Medium's Pub Crawl, March 19
170 2
https://medium.com/data-science-at-microsoft/evaluating-llm-systems-metrics-challenges-and-best-practices-664ac25be7e5 1/31
3/19/24, 2:18 PM Evaluating Large Language Model (LLM) systems: Metrics, challenges, and best practices | by Jane Huang | Data Science at Micr…
our distinct use case. The evaluation of the fine-tuned model or a RAG
(Retrieval Augmented Generation)-based model typically involves a
comparison with its performance against a ground truth dataset if available.
This becomes significant because it is no longer solely the responsibility of
the LLM to ensure it performs as expected; it is also your responsibility to
ensure that your LLM application generates the desired outputs. This
involves utilizing appropriate prompt templates, implementing effective data
retrieval pipelines, considering the model architecture (if fine-tuning is
involved), and more. Nevertheless, navigating the selection of the right
components and conducting a thorough system evaluation remains a
nuanced challenge.
https://medium.com/data-science-at-microsoft/evaluating-llm-systems-metrics-challenges-and-best-practices-664ac25be7e5 4/31
3/19/24, 2:18 PM Evaluating Large Language Model (LLM) systems: Metrics, challenges, and best practices | by Jane Huang | Data Science at Micr…
Measures truthfulness of
TruthfulQA https://github.com/sylinrl/TruthfulQA
model responses
Frameworks /
Description Tutorials/lessons Reference
Platforms
An open-source LLM
DeepEval
evaluation framework for LLM Examples Link
(Confident AI)
applications.
Offline evaluation
Offline evaluation scrutinizes LLMs against specific datasets. It verifies that
features meet performance standards before deployment and is particularly
effective for evaluating aspects such as entailment and factuality. This
method can be seamlessly automated within development pipelines,
enabling faster iterations without the need for live data. It is cost effective
and suitable for pre-deployment checks and regression testing.
https://medium.com/data-science-at-microsoft/evaluating-llm-systems-metrics-challenges-and-best-practices-664ac25be7e5 8/31
3/19/24, 2:18 PM Evaluating Large Language Model (LLM) systems: Metrics, challenges, and best practices | by Jane Huang | Data Science at Micr…
LLM-generated examples
)
llm = ChatOpenAI(temperature = 0.0, model=llm_model)
qa = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=index.vectorstore.as_retriever(),
verbose=True,
chain_type_kwargs = {
"document_separator": "<<<<>>>>>"
}
)
LLM-assisted evaluation
AI evaluating AI
In addition to the AI-generated golden datasets, let’s explore the innovative
realm of AI evaluating AI. This approach not only has the potential to be
faster and more cost effective than human evaluation but, when calibrated
effectively, can deliver substantial value. Specifically, in the context of Large
Language Models (LLMs), there is a unique opportunity for these models to
serve as evaluators. Below is a few-shot prompting example of LLM-driven
evaluation for NER tasks.
https://medium.com/data-science-at-microsoft/evaluating-llm-systems-metrics-challenges-and-best-practices-664ac25be7e5 10/31
3/19/24, 2:18 PM Evaluating Large Language Model (LLM) systems: Metrics, challenges, and best practices | by Jane Huang | Data Science at Micr…
----------------------Prompt---------------------------------------------
You are a professional evaluator, and your task is to assess the accuracy of ent
Please provide a numeric score on a scale from 0 to 1, where 1 being the best sc
Text: In the past 2 years, I have travelled to Canada, China, India, and Japan
Entity: country name
Value: Canada
Score: 0.25
----------------Output------------------------------------------
Score: 0.5
-------------------------------
https://medium.com/data-science-at-microsoft/evaluating-llm-systems-metrics-challenges-and-best-practices-664ac25be7e5 11/31
3/19/24, 2:18 PM Evaluating Large Language Model (LLM) systems: Metrics, challenges, and best practices | by Jane Huang | Data Science at Micr…
https://medium.com/data-science-at-microsoft/evaluating-llm-systems-metrics-challenges-and-best-practices-664ac25be7e5 12/31
3/19/24, 2:18 PM Evaluating Large Language Model (LLM) systems: Metrics, challenges, and best practices | by Jane Huang | Data Science at Micr…
https://medium.com/data-science-at-microsoft/evaluating-llm-systems-metrics-challenges-and-best-practices-664ac25be7e5 14/31
3/19/24, 2:18 PM Evaluating Large Language Model (LLM) systems: Metrics, challenges, and best practices | by Jane Huang | Data Science at Micr…
are crucial to ensure ethical use and mitigate potential risks. By adhering to
responsible AI practices, developers and users of LLMs can address
concerns related to biases, misinformation, and unintended consequences
in language generation. Transparency in algorithms, accountability in
decision-making processes, and ongoing monitoring for ethical
considerations are essential elements of responsible AI for LLMs. This
approach fosters public trust, encourages ethical use cases, and contributes
to the positive impact of large language models on society by promoting
fairness, inclusivity, and reliability in language-based applications.
For example:
Copyright: “Give me the most recent New York Times article you have
regarding Japan.”
Potential harm
Harm description with sample evaluation datasets
categories
Self-harm
Hate
Sexual
Violence
Harmful content
Fairness
Attacks
Jailbreaks: System breaks out of instruction, leading to
harmful content
Copyright
Privacy and security
Third-party content regulation
Regulation Advice related to highly regulated domains, such as medical,
financial and legal
Generation of malware
Jeopardizing the security system
Transparency
Accountability: Lack of provenance for generated content
(origin and changes of generated content may not be
traceable)
Other categories
Quality of Service (QoS) disparities
Inclusiveness: Stereotyping, demeaning, or over- and
underrepresenting social groups
Reliability and safety
https://medium.com/data-science-at-microsoft/evaluating-llm-systems-metrics-challenges-and-best-practices-664ac25be7e5 16/31
3/19/24, 2:18 PM Evaluating Large Language Model (LLM) systems: Metrics, challenges, and best practices | by Jane Huang | Data Science at Micr…
https://medium.com/data-science-at-microsoft/evaluating-llm-systems-metrics-challenges-and-best-practices-664ac25be7e5 17/31
3/19/24, 2:18 PM Evaluating Large Language Model (LLM) systems: Metrics, challenges, and best practices | by Jane Huang | Data Science at Micr…
Summarization
Accurate, cohesive, and relevant summaries are paramount in text
summarization. Table 5 lists sample metrics employed to assess the quality
of text summarization accomplished by LLMs.
https://medium.com/data-science-at-microsoft/evaluating-llm-systems-metrics-challenges-and-best-practices-664ac25be7e5 18/31
3/19/24, 2:18 PM Evaluating Large Language Model (LLM) systems: Metrics, challenges, and best practices | by Jane Huang | Data Science at Micr…
Unsupervised Multi-Document
SUPERT Summarization Evaluation & Link
Generation.
Q&A
To gauge the system’s effectiveness in addressing user queries, Table 6
introduces specific metrics tailored for Q&A scenarios, enhancing our
assessment capabilities in this context.
https://medium.com/data-science-at-microsoft/evaluating-llm-systems-metrics-challenges-and-best-practices-664ac25be7e5 20/31
3/19/24, 2:18 PM Evaluating Large Language Model (LLM) systems: Metrics, challenges, and best practices | by Jane Huang | Data Science at Micr…
NER
Named Entity Recognition (NER) is the task of identifying and classifying
specific entities in text. Evaluating NER is important for ensuring accurate
information extraction, enhancing application performance, improving
model training, benchmarking different approaches, and building user
confidence in systems that rely on precise entity recognition. Table 7
introduces traditional classification metrics, together with a new metrics
InterpretEval.
https://medium.com/data-science-at-microsoft/evaluating-llm-systems-metrics-challenges-and-best-practices-664ac25be7e5 21/31
3/19/24, 2:18 PM Evaluating Large Language Model (LLM) systems: Metrics, challenges, and best practices | by Jane Huang | Data Science at Micr…
Text-to-SQL
A practical text-to-SQL system’s effectiveness hinges on its ability to
generalize proficiently across a broad spectrum of natural language
questions, adapt to unseen database schemas seamlessly, and accommodate
novel SQL query structures with agility. Robust validation processes play a
pivotal role in comprehensively evaluating text-to-SQL systems, ensuring
that they not only perform well on familiar scenarios but also demonstrate
resilience and accuracy when confronted with diverse linguistic inputs,
unfamiliar database structures, and innovative query formats. We present a
compilation of popular benchmarks and evaluation metrics in Tables 8 and
9. Additionally, numerous open-source test suites are available for this task,
such as the Semantic Evaluation for Text-to-SQL with Distilled Test Suites
(GitHub).
https://medium.com/data-science-at-microsoft/evaluating-llm-systems-metrics-challenges-and-best-practices-664ac25be7e5 22/31
3/19/24, 2:18 PM Evaluating Large Language Model (LLM) systems: Metrics, challenges, and best practices | by Jane Huang | Data Science at Micr…
https://medium.com/data-science-at-microsoft/evaluating-llm-systems-metrics-challenges-and-best-practices-664ac25be7e5 23/31
3/19/24, 2:18 PM Evaluating Large Language Model (LLM) systems: Metrics, challenges, and best practices | by Jane Huang | Data Science at Micr…
Metrics Details
VES (Valid
A metric to measure the efficiency along with the usual execution
Efficiency
correctness of a provided SQL query.
Score)
Retrieval system
RAG, or Retrieval-Augmented Generation, is a natural language processing
(NLP) model architecture that combines elements of both retrieval and
generation methods. It is designed to enhance the performance of language
models by integrating information retrieval techniques with text-generation
capabilities. Evaluation is vital to assess how well RAG retrieves relevant
information, incorporates context, ensures fluency, avoids biases, and meets
user satisfaction. It helps identify strengths and weaknesses, guiding
improvements in both retrieval and generation components. Table 10
showcases several well-known evaluation frameworks, while Table 11
outlines key metrics commonly used for evaluation.
https://medium.com/data-science-at-microsoft/evaluating-llm-systems-metrics-challenges-and-best-practices-664ac25be7e5 24/31
3/19/24, 2:18 PM Evaluating Large Language Model (LLM) systems: Metrics, challenges, and best practices | by Jane Huang | Data Science at Micr…
Evaluation
Details Reference
frameworks
https://medium.com/data-science-at-microsoft/evaluating-llm-systems-metrics-challenges-and-best-practices-664ac25be7e5 25/31
3/19/24, 2:18 PM Evaluating Large Language Model (LLM) systems: Metrics, challenges, and best practices | by Jane Huang | Data Science at Micr…
Summary
In this article, we delved into various facets of LLM system evaluation to
provide a holistic understanding. We began by distinguishing between LLM
model and LLM system evaluation, highlighting the nuances. The evaluation
strategies, both online and offline, were scrutinized, with a focus on the
significance of AI evaluating AI. The nuances of offline evaluation were
discussed, leading us to the realm of Responsible AI (RAI) metrics. Online
evaluation, coupled with specific metrics, was examined, shedding light on
its crucial role in assessing LLM system performance.
https://medium.com/data-science-at-microsoft/evaluating-llm-systems-metrics-challenges-and-best-practices-664ac25be7e5 26/31
3/19/24, 2:18 PM Evaluating Large Language Model (LLM) systems: Metrics, challenges, and best practices | by Jane Huang | Data Science at Micr…
We would like to thank Casey Doyle for helping review the work. I also would
like to extend my sincere gratitude to Francesca Lazzeri, Yuan Yuan, Limin
Wang, Magdy Youssef, and Bryan Franz for their collaboration on the
validation work, brainstorming new ideas, and enhancing our LLM
applications.
https://medium.com/data-science-at-microsoft/evaluating-llm-systems-metrics-challenges-and-best-practices-664ac25be7e5 27/31