Newwhitepaper_Solving Domain-Specific Problems Using LLMs
Newwhitepaper_Solving Domain-Specific Problems Using LLMs
Domain-Specific
Problems Using
LLMs
Authors: Christopher Semturs,
Shekoofeh Azizi, Scott Coull,
Umesh Shankar
and Wieland Holfelder
Solving Domain-Specific Problems Using LLMs
Acknowledgements
Johan Schalkwyk
Sami Lachgar
Dale Webster
Yun Liu
Connie Fan
Chris Grier
Antonio Gulli
Anant Nawalgaria
Antonio Gulli
Anant Nawalgaria
Grace Mollison
Technical Writer
Karen Lin
Designers
Michael Lanning
Luxi Chen
September 2024 2
Table of contents
Introduction 4
Challenges in cybersecurity 6
The opportunities 21
Summary 35
Endnotes 36
Solving Domain-Specific Problems Using LLMs
Introduction
Large language models (LLMs) have emerged as powerful tools for tackling complex
challenges in numerous domains. While early iterations focused on general-purpose tasks,
recent developments have highlighted the potential of fine-tuning LLMs to address specific
problems within specialized fields. This whitepaper explores these concepts in two distinct
domains: cybersecurity and medicine. Each showcases the unique ability of LLMs to enhance
existing workflows and unlock new possibilities.
September 2024 4
Solving Domain-Specific Problems Using LLMs
In the field of medicine, LLMs face a different set of obstacles, such as the vast and ever-
evolving nature of medical knowledge and the need to apply said knowledge in a context-
dependent manner that makes accurate diagnosis and treatment a continual challenge.
LLMs like Med-PaLM, customized for medical applications, demonstrate the ability to answer
complex medical questions and provide insightful interpretations of medical data, showing
potential for supporting both clinicians and patients.
Through the lens of these two distinct domains, in this whitepaper we will explore the
challenges and opportunities presented by specialized data, technical language, and
sensitive use cases. By examining the unique paths taken by SecLM and Med-PaLM, we
provide insights into the potential of LLMs to revolutionize various areas of expertise.
September 2024 5
Solving Domain-Specific Problems Using LLMs
Challenges in cybersecurity
In the movies, we often see information security reduced to the caricature of hoodie-clad
and headset-wearing hackers with ill intent, armed with ruggedized laptops, tapping away
furiously until we hear the two magic words: “I’m in.”
To the extent that you even see the defenders, they are in reactive mode-think war rooms,
empty coffee cups, people barking orders, and monitors showing the attacker’s every move
in real-time.
In reality, the people who practice cybersecurity - the developers, system administrators,
SREs, and many junior analysts to whom our work here is dedicated - have the Sisyphean
task of keeping up with the latest threats and trying to protect complex systems against
them. Many practitioners’ days are largely filled with repetitive or manual tasks, such as
individually triaging hundreds of alerts, that take valuable time away from developing more
strategic defenses. The momentum is definitely not in the defender’s favor; attackers are
adopting advanced technologies, including artificial intelligence,1 to extend their reach and
quicken the pace of exploitation. And there are definitely no monitors showing the attacker’s
every move!
Based on our experience working with users and partners, we see three major challenges in
the security industry today: threats, toil, and talent.
• New and evolving threats: The threat landscape is constantly changing, with new
and increasingly sophisticated attacks emerging all the time. This makes it difficult for
defenders to keep up with the latest information, and conversely for practitioners to sift
through that flood of data to identify what’s relevant to them and take action.
September 2024 6
Solving Domain-Specific Problems Using LLMs
• Operational toil: People working in security operations or DevOps roles often spend a
significant amount of time on repetitive manual tasks that could be automated or assisted.
This leads to overload and takes away time from more strategic activities. Excessive focus
on minutiae also prevents analysts and engineers from seeing the bigger picture that is
key to securing their organizations.
Without the ability to address these three challenges, it will be difficult to keep up with the
demands of modern cybersecurity systems.
We envision a world where novices and security experts alike are paired with AI expertise
to free themselves from repetition and toil, accomplish tasks that seem impossible to us
today, and provide new opportunities to share knowledge. Large language models (LLMs)
and adjacent GenAI techniques can meaningfully improve the working lives of both security
novices and experienced practitioners. Indeed, in many cases, we have already found that
GenAI is useful to solve a number of real-world security problems in our challenge areas:
September 2024 7
Solving Domain-Specific Problems Using LLMs
September 2024 8
Solving Domain-Specific Problems Using LLMs
• Top layer: existing security tools that understand the relevant context and data, and
can actuate necessary changes;
Notably, one of the key benefits of LLMs is their ability to process and synthesize vast
amounts of heterogenous data – an important capability in the increasingly siloed world
of cybersecurity data. We seek to leverage that capability to solve challenging security
September 2024 9
Solving Domain-Specific Problems Using LLMs
This API offers rich planning capabilities that combine LLMs and other ML models, Retrieval-
Augmented Generation (RAG) to ground results in authoritative data, and tool use to perform
actions or look up relevant information. We argue that this holistic approach is critical
because accuracy is so important in security and LLMs alone cannot inherently solve all
security problems.
Our vision of the SecLM API is to provide a ‘one-stop shop’ for getting answers to security
questions, regardless of their level of complexity. That is, the engineer or analyst can pose
questions and refer to data sources with natural language, and expect an answer that
automatically incorporates the necessary information. However, security problems often
require a lot of information to be gathered and analyzed using domain-specific reasoning,
often by experts across several disciplines.
Ideally, one can ask the SecLM API a question in a zero-shot manner and get a high-quality
response without fussing over prompting or manually integrating external data. In order to
achieve this in a coherent and seamless manner, it is important to have a well-designed API
that interacts with LLMs and traditional ML models, the user’s data, and other services to
accurately complete the task at hand. Due to the complex nature of these security problems,
we must aim to address the following key requirements:
September 2024 10
Solving Domain-Specific Problems Using LLMs
• Freshness: The model should be able to access the latest threat and vulnerability
data, which changes on a daily basis. Due to its cost and duration (often days),
retraining the model on a daily or hourly basis to incorporate the latest data is not a
feasible approach.
• User-specific data: The model should be able to operate on the user’s own security
data within the user’s environment without the risk of exposing that sensitive data
to others or the infrastructure provider. This rules out any centralized training on
user data.
• User-specific data: The model should be able to reason about the provided security
data in a multi-step fashion by combining different data sources, techniques, and
specialized models to solve security problems.
SecLM addresses these challenges through the use of security-specialized LLMs, traditional
ML models, and a flexible planning framework that enables dynamic use of tools and
interaction among multiple domain-specialized agents to reason over the provided data.
Here, we will briefly discuss our approach to training security-specialized models and
designing the planning framework that drives the SecLM API.
September 2024 11
Solving Domain-Specific Problems Using LLMs
One of the things we observed in applying LLMs to security is that general-purpose models
didn’t peform as well as we needed on some security tasks. The reasons for this fall into
three categories:
• Lack of publicly available security data: LLMs are data-hungry, requiring large pre-
training corpora for best results. At the same time, security data is sensitive so we cannot
use real security data in training. Moreover, what little data is available publicly is usually
concentrated on a small number of the most popular security products or on generic
security content that lacks connection to concrete application.
• Limited depth of security content: Similarly, there is a certain highly technical language
that is used to talk about security or express security insights, often crossing disciplines
from low-level computer science concepts to high-level policy and intelligence analysis.
To be effective, security LLMs must seamlessly blend this language, connect them to
their underlying technical concepts, and synthesize relevant, accurate output for security
analysts and engineers to consume. While there are some high-quality, in-depth articles
that explain how to address well-known vulnerabilities or attacks, thousands of new
threats emerge each year.
• Sensitive use cases: There are some use cases in security that general purpose models
do not handle by design such as abuse areas like malware or phishing. In most cases,
general-purpose LLMs would actively work to avoid incorporating such tasks or related
data for fear of increasing risk of misuse or abuse. However, these cases are crucial for
security practitioners looking to secure their systems, to analyze artifacts, or even for
testing purposes.
September 2024 12
Solving Domain-Specific Problems Using LLMs
This broad set of supported tasks means that we have to take into account multiple use
cases and environments when making design decisions, such as choosing the model size and
composition of training tasks. For example, an LLM with hundreds of billions of parameters
may maximize reasoning and abstraction capabilities, but might not be ideal for latency-
sensitive or high-volume tasks, like summarizing and categorizing security events.
To ensure the model generalizes to new tasks and security products not directly visible in the
training data, we have to be very careful with the training regime used to create the models.
As an example, consider that for many task areas, such as translating natural language into a
domain-specific query language, it is highly likely that any training data we have will contain
only a fraction of the eventual targets for our users. In this case, without careful curation
of the training data, we may inadvertently eliminate the ability of the model to generalize
to new tasks or data sources that are important to users. Likewise, some data sources are
particularly sensitive or proprietary and should not be included in generalized training of
the model. Instead, these data sources should be incorporated into a specialized derivative
model (using a lightweight, parameter-efficient process) that does not degrade the overall
performance of the core security-specialized model.
The training process, shown in Figure 1, demonstrates how we leverage each phase of
training to target specific tasks and types of data to balance performance, generalization,
and separation of proprietary data.
September 2024 13
Solving Domain-Specific Problems Using LLMs
Figure 1. High-level training flow for core SecLM and specialized derivative models
As pre-training is the most expensive and time-consuming stage, it makes sense to start
from a robust foundational model with exposure to the broadest set of training data possible,
including billions or even trillions of tokens of general text, code, and structured data across
dozens of languages and formats. This gives us the added benefit of multilingual support,
which is an important feature for threat intelligence use cases and international users.
September 2024 14
Solving Domain-Specific Problems Using LLMs
perform the broad range of tasks that SecLM models will be trained on in the supervised
fine-tuning phase. Here, proprietary data is compartmentalized within specific tasks that
mirror those performed by security experts on a day-to-day basis, including analysis of
malicious scripts, explanation of command line invocations, explanation of security events,
summarization of threat intelligence reports, and generation of queries for specialized
security event management technologies.
Given the diversity of downstream tasks that are expected of the model, evaluating its
performance can be a challenging exercise, particularly when some categories of tasks may
experience inherent trade-offs. For this reason, the fine-tuned model is evaluated using a
number of complementary methods. Several of our downstream tasks, such as malware
classification and certain types of simple security-focused question answering, can be
framed as classification problems and a standard battery of classification metrics can be
used to concretely quantify the performance on those tasks. For other, less quantifiable
tasks, we can leverage a set of golden responses that we can use to calculate similarity-
based metrics (e.g., ROUGE,2 BLEU,3 BERTScore4), but we can also compare across models
using automated side-by-side preference evaluations using a separate (oftentimes larger)
LLM. Finally, given the highly technical nature of security problems and the importance of
accuracy in our tasks, we rely on expert human evaluators to score outputs using a Likert
scale and side-by-side preference evaluation. Taken together, these metrics provide us with
the guidance needed to ensure our fine-tuning training has improved overall model quality,
and help us direct future changes in model training.
At the conclusion of the fine-tuning stage, we have a model capable of performing many
of the same core tasks as security experts. However, because of our need to ensure
generalization across a wide range of user environments and the inherent trade-off among
some security tasks, the model may still require the use of in-context learning examples,
retrieval-augmented generation, and parameter-efficient tuning (PET) methods. For example,
if a new user wanted to leverage SecLM to query and analyze data on a new security platform
September 2024 15
Solving Domain-Specific Problems Using LLMs
that was not present during core training, it is likely that the model may need in-context
examples to help generalize to the new system. Similarly, if a user wanted to incorporate
specialized knowledge about their network and assets or better align model behavior with
human security experts, it would be best added via PET adapters trained on their sensitive
data. Retrieval-augmented generation, meanwhile, allows us to pull in the freshest and most
recent threat information for the model to process, rather than relying on stale data ingested
during less frequent training runs.
As you might imagine, actually building the underlying framework that orchestrates the
planning and execution of these complex tasks requires solving some difficult systems
engineering and machine learning challenges. The example, shown in Figure 2, illustrates how
SecLM's specialized models can be tied into a broader ecosystem to best leverage fresh,
user-specific data and authoritative security expertise in a natural and seamless way.
September 2024 16
Solving Domain-Specific Problems Using LLMs
Figure 2. SecLM platform leveraging multi-step reasoning to answer a broad, high-level question about
advanced persistent threat actor activity
In Figure 2, we have a fairly broad, high-level question regarding the tactics, techniques, and
procedures (TTPs) of an advanced persistent threat (APT) group, in this example ‘APT41’. The
analyst asking this question needs to understand what those TTPs are and discover potential
indications of them in their own network. To answer this question, the SecLM API needs to
invoke a complex, multi-step planning process to break down the problem into individual
tasks: 1) Retrieve the necessary information, 2) Extract and synthesize that information, 3)
Use the information to query the relevant events from the user’s Security Information and
Event Management (SIEM) product. In the SecLM reasoning framework, this plan can be
generated statically by security experts or in real-time through a combination of expert
guidance and highly-capable LLMs using chain-of-thought style prompting.
First, the SecLM API planner retrieves the most recent information about “APT41” from one of
possibly many of the user’s threat intelligence subscriptions. That raw response is processed
to extract TTP information and possible indicators of compromise from the voluminous threat
September 2024 17
Solving Domain-Specific Problems Using LLMs
intelligence data. Next, a specialized SecLM fine-tuned (using PET) for the query language
of the SIEM is used to translate those TTPs into concrete clauses in the appropriate syntax
and using the appropriate schema. Using that query, the API can then directly retrieve
the matching security events from the SIEM, and finally use SecLM to aggregate all of the
available information into a comprehensible final response for the analyst.
Overall, the SecLM API would save the analyst in the above example substantial time -
possibly hours - by automating multiple tedious steps across several different security
services and systems. Meanwhile, the analyst’s time and attention are available to consider
the results and plan for follow-up investigations or remediation steps, which may also be
assisted by the SecLM API. While this is one example of how the SecLM API automatically
plans and orchestrates operations across multiple models and retrieval sources, there are
a multitude of such use cases where tool use (e.g., code execution), retrieval-augmented
generation, specialized models, and long-term memory (e.g., storage of user preferences)
can help solve challenging security problems and answer difficult questions that save users
valuable time, even autonomously with the use of agents..
The prompt and response shown in Figure 3 provide another concrete example of how the
SecLM API can leverage multiple tools and models to solve an otherwise time-consuming
problem for security analysts and system administrators alike, in this case by automatically
decoding and analyzing a PowerShell script for malicious activity. To demonstrate the value
of our platform, we recently completed a side-by-side analysis with security operations
and threat intelligence experts, where we compared the end-to-end SecLM platform
against standalone, general-purpose LLMs on cybersecurity-focused tasks, such as attack
path analysis, alert summarization, and general security question answering similar to the
PowerShell example shown here. The results demonstrated a clear preference for SecLM,
with win rates between 53% and 79% across the security-focused tasks, and underscore the
importance of a full-featured platform in the domain of cybersecurity.
September 2024 18
Solving Domain-Specific Problems Using LLMs
Figure 3. An example response from the SecLM platform using a base64 decoding tool and the SecLM model
to analyze an obfuscated PowerShell command used in a ‘living off the land’ attack
In this section, we have seen how a holistic approach that combines large language models
(LLMs) and authoritative data sources with a flexible planning framework can help security
practitioners by gathering, aggregating, and intelligently processing security data. We have
also seen how SecLM and its supporting infrastructure are being built to provide a one-stop
security platform for experts, junior analysts, and systems administrators. These advances,
combined with human expertise, can transform the practice of security, obtaining superior
results with less toil for the people who do it.
September 2024 19
Solving Domain-Specific Problems Using LLMs
Medical question-answering (QA) has always been a grand challenge in artificial intelligence
(AI). The vast and ever-evolving nature of medical knowledge, combined with the need for
accurate and nuanced reasoning, has made it difficult for AI systems to achieve human-level
performance on medical QA tasks.
However, large language models (LLMs) trained on massive datasets of text have shown
promising results on a variety of medical QA benchmarks. LLMs are able to understand and
apply complex medical concepts in a way that was not possible for previous generations of
AI systems.
In addition, the increasing availability of medical data and the growing field of medical NLP
have created new opportunities for innovation in medical QA. Researchers are now able to
develop systems that can answer medical questions from a variety of sources, including
medical textbooks, research papers, and patient records.
September 2024 20
Solving Domain-Specific Problems Using LLMs
This combination of technical capabilities and data availability provides the groundwork for
models like Med-PaLM, an LLM aligned and fine-tuned based on the PaLM family of models.
The development of Med-PaLM is only the start of a journey with the goal of improving health
outcomes by making the technology available to researchers, clinicians, and other users.
The opportunities
Gen AI has the potential to fundamentally transform the medical field in both diagnostic and
non-diagnostic aspects, in numerous ways. For example:
• Empowering users to ask questions in the context of the medical history in their health
record such as “what are good weekend activities for me to consider, given the surgery I
underwent two weeks ago?”
• Enhancing the patient intake process by moving beyond a fixed set of questions and
instead adapting based on the patient's responses. This allows for more efficient and
comprehensive data collection and provides a more cohesive summary to the clinical staff.
September 2024 21
Solving Domain-Specific Problems Using LLMs
This list represents merely a small selection from a vast array of possibilities, illustrating the
extensive range of options previously considered unattainable with earlier technologies.
The field of medicine also serves as a use case with a strong culture and need for
responsible innovation. Medical applications are regulated due to the importance of patient
safety. While GenAI systems can be used to develop new diagnostic tools, treatment plans,
and educational materials, it is important to validate the safety and efficacy of such systems
before their implementation in clinical practice. This means that scientific experimentation
requires a thoughtful, phased approach with retrospective studies (i.e., using de-identified
data from past cases so that research does not impact patient care) happening before
prospective studies (i.e., running the model on newly collected data in a specific setting of
interest, sometimes interventionally so that impact on patient care can be measured).
Many AI systems developed for medicine today lack the ability to interact with users, but
instead produce structured outputs such as “yes” or “no”, or a numerical output. While this
type of output is useful in many scenarios for clinicians, this output is inflexible. Models also
need to be created for every application, which slows down innovation.
In our view,5 medicine revolves around caring for people, and needs to be human-centric. As
such, an ambitious goal would be a flexible AI system that can interact with people and assist
in many different scenarios while taking into account the appropriate context. To
create such a system, it is essential to incorporate a wide range of experiences, perspectives,
and expertise when building AI systems. Data and algorithms should go hand in hand with
language and interaction, empathy, and compassion.
The objective behind this project is to enhance the effectiveness, helpfulness, and safety
of AI models in medicine by incorporating natural language and facilitate interactivity for and
between clinicians, researchers, and patients. To bring this vision to life, we took the initial
September 2024 22
Solving Domain-Specific Problems Using LLMs
The recent progress in foundation models,6 such as LLMs, as large pre-trained AI systems
that can be easily adapted for various domains and tasks presents an opportunity to
rethink the development and use of AI in medicine on a broader scale. These expressive
and interactive models hold significant potential to make medical AI more performant, safe,
accessible, and equitable by flexibly encoding, integrating, and interpreting medical data
at scale.
• Our first version of Med-PaLM, described in a preprint in late 2022 and published in
Nature in July 2023,7 was the first AI system to exceed the passing mark on US Medical
License Exam (USMLE)-style questions.8 The study also evaluated long-form answers and
described a comprehensive evaluation framework.
September 2024 23
Solving Domain-Specific Problems Using LLMs
These advances reflect our belief that innovation can take major strides in a short period of
time, and be done responsibly and with rigor.
Figure 4 shows an example of a USMLE-style question. Individuals taking the test are given
a concise patient profile that includes information such as their symptoms and prescribed
medications. A medical question is presented based on the provided scenario, and test-
takers are required to choose the correct response from multiple choices.
September 2024 24
Solving Domain-Specific Problems Using LLMs
Correctly answering the question requires the individual taking the test to comprehend
symptoms, interpret a patient’s test results, engage in intricate reasoning regarding the
probable diagnosis, and ultimately select the correct choice for the most suitable disease,
test, or treatment combination. In summary, a combination of medical comprehension and
understanding, knowledge retrieval, and reasoning is vital for success. It takes years of
education and training for clinicians to develop the knowledge needed to consistently answer
these questions accurately.
September 2024 25
Solving Domain-Specific Problems Using LLMs
As every clinician will attest to, merely passing the USMLE does not indicate proficiency
in diagnosing or managing patients clinically. Instead, USMLE is a specific assessment
of knowledge and reasoning based on concrete scenarios. Nevertheless, USMLE serves
as a useful benchmark since the answer is typically documented and evaluation can be
conducted programmatically at scale. This contributed to its historical popularity as a
benchmark in scientific research as a grand challenge in the past, which makes it so powerful
to demonstrate how technology facilitates significant advancements.
Figure 5. Med-PaLM 2 reached expert-level performance on the MedQA medical exam benchmark
Med-PaLM was the first AI model to exceed the passing mark, reaching the performance of
67%, and Med-PaLM 2 was the first AI model to reach 86.5%, which indicates expert-level
performance (Figure 5).
September 2024 26
Solving Domain-Specific Problems Using LLMs
• How does the answer relate to the consensus in the scientific and clinical community?
• Does the answer contain info that is inapplicable or inaccurate for any particular
medical demographic?
• How well does the answer address the intent of the question?
• How helpful is this answer to the user? Does it enable them to draw a conclusion or help
clarify next steps?
September 2024 27
Solving Domain-Specific Problems Using LLMs
September 2024 28
Solving Domain-Specific Problems Using LLMs
• Those answers are then presented in a blinded way (i.e., who provided each answer is not
indicated) to separate raters.
It is important to emphasize that the evaluation primarily focuses on the substance over the
style / delivery. In certain instances, a clinician’s response may be concise yet effectively
meets the evaluation criteria, while in other scenarios, a more detailed but verbose answer
may be more appropriate.
Our human evaluation results as of May 2023 indicate that the answers provided by
our models compare well to those from physicians across several critical clinically
important axes.
Since conducting evaluations with scientific rigor requires the involvement of expert laborers,
such as board-certified physicians, the process is notably costlier than evaluating multiple-
choice questions. It is promising to see that other studies10 have adopted and expanded upon
the suggested framework for the purpose of being comparative and aligned with AI safety.
The expert evaluation plays a vital role critical in discerning style (i.e., delivery) and content
as well as correctness.
We also learned that more work remains, including improvements along specific evaluation
axes where physicians’ performance remained superior.
September 2024 29
Solving Domain-Specific Problems Using LLMs
The detailed results are the cornerstone of understanding and identifying areas in need of
future scientific modeling and evaluation, as well as determining the feasibility of the next
step in our journey.
The integration of technology into the clinical environment is a well-established area, and
Google has gained its own expertise5 in the field through screening for diabetic retinopathy.
One of the main insights learned is that achieving high performance on retrospective
datasets does not automatically translate into clinical performance. It is imperative to
carefully validate AI solutions in real-world environments in a meticulous manner to ensure
their robustness and reliability.
Each technology integrated into a patient’s journey, whether it falls under regulatory
oversight or not, is encouraged to adhere to these scientific steps:
September 2024 30
Solving Domain-Specific Problems Using LLMs
These steps are crucial not just for assessing the model's performance on new unseen data
but also, more significantly, for evaluating the effectiveness of the end-to-end system when
integrated into real workflows. Occasionally, the optimal way to use GenAI models like Med-
PaLM may diverge from initial assumptions, and introducing a new tool into a clinical workflow
might require unexpected adjustments to the overall process.11,12 End-to-end assessment is
essential for understanding the role and benefit of the technology and tailoring AI solutions
to meet the needs effectively.
Med-PaLM7 highlighted the significance and value of a specialized model for the medical
domain. Med-PaLM 2, an aligned and fine-tuned iteration of PaLM 2 tailored to medical
knowledge, achieves a ninefold enhancement in precise reasoning compared to the
baseline.13 However, it's crucial to recognize that excelling in one medical domain task doesn't
necessarily guarantee and imply success in a different medical domain task. For instance,
does a great general medical QA system also perform well on a mental health assessment
task? While it's reasonable to assume that a demonstrated understanding of clinical
knowledge can generalize effectively to tasks heavily relying on this knowledge, each specific
task requires validation and possible adaptation, such as the measurement of psychiatric
functioning,14 before proceeding further.
September 2024 31
Solving Domain-Specific Problems Using LLMs
The medical domain also extends well beyond textual information. The practice of medicine is
inherently multi-modal and incorporates information from images, electronic health records,
sensors, wearables, genomics, and more. Multimodal versions15 of MedLM and related
approaches16,17,18 are in early stages of research, and follow the same validation principles and
workflow integration approach. We will be observing the multimodal-enabled set of use-
cases evaluated and deployed in the field.
Lastly, a medically specialized model can be applied not only to clinical use cases that relate
directly to patient care, but also to use cases that benefit from leveraging medical knowledge
in a flexible way. An example is in scientific discovery, where Med-PaLM can be used to
accurately identify genes associated with biomedical traits.19 We'll be exploring a breadth
of possibilities with vertical-specific models, and we expect new applications and ideas to
emerge in the field over the next few years. We’re also exploring safe and responsible ways
to bring these models to the healthcare industry. With MedLM, a suite of models fine-tuned
for healthcare use cases, built on Med-PaLM 2, we’re making solutions commercially available
so healthcare organizations can build GenAI use cases suitable for their workflows.
Med-PaLM 2 is an advancement of the base LLM model PaLM 2, Google's enhanced LLM with
substantial performance improvements on multiple LLM benchmark tasks. To tailor Med-
PaLM 2 for medical applications, instruction fine-tuning7 was performed using MultiMedQA,7
including MedQA, MedMCQA, HealthSearchQA, LiveQA, and MedicationQA datasets. Dataset
mixture ratios were empirically determined.
September 2024 32
Solving Domain-Specific Problems Using LLMs
the language model to condition on its own intermediate outputs for multi-step problem-
solving. Self-consistency plays a role in enhancing the model's performance on multiple-
choice questions by sampling multiple explanations and answers from the model, with the
final answer determined by a majority vote among the generated options. These strategies
collectively improve the model's ability to reason and provide more accurate responses to
complex and multi-faceted queries.
Figure 7. Ensemble refinement (ER) in Med-PaLM 2. This approach involves conditioning an LLM on multiple
potential reasoning pathways it generates, facilitating the answer refinement and improvement
September 2024 33
Solving Domain-Specific Problems Using LLMs
The goal behind the inception of the Med-PaLM research effort was to improve health
outcomes via using and advancing emerging AI technologies. Achieving expert-level
performance in medical QA tasks was the first step, with many more to follow in close
collaboration with the clinical community as we progress on this journey.
Our health research experience at Google demonstrated repeatedly that technology is often
not the sole challenge in applying AI productively to healthcare. Instead, many other factors,
including thoughtful evaluation strategies and working on clinically meaningful applications
in partnership with clinicians and a broad cross-functional team, are pivotal to success.5 This
valuable insight is likely applicable to other vertical domains as well.
As AI technology matures and moves closer to practical use cases and real-world scenarios,
careful multi-step evaluations, including both retrospective and prospective assessments,
are beneficial to better understand the real role and benefits of the technology in the whole
workflow. Guidance by a clinical partner improves the chances of building the right solution
for better health outcomes. Many promising applications lie in the collaboration of healthcare
workers and technology, combining the strengths of both. It is also important to use GenAI
systems in a way that is respectful of patients' autonomy and privacy.
For the foreseeable future, it is reasonable to assume that models customized for specific
applications or domains will yield better results, and we are tracking trends and any
convergence in performance between general and specific models in the years ahead. For
Med-PaLM specifically, our research progress will be tracked at the Med-PaLM research
webpage.20 We aim to make progress more broadly in the field of using AI and GenAI for the
betterment of patients, clinicians, and researchers.
September 2024 34
Solving Domain-Specific Problems Using LLMs
Summary
This whitepaper explores the potential of LLMs in tackling complex challenges within specific
domains, with a particular focus on healthcare and cybersecurity.
September 2024 35
Solving Domain-Specific Problems Using LLMs
Endnotes
1. Cantos, J., et al., 2023. Threat Actors are Interested in Generative AI, but Use Remains Limited. [online]
Available at: https://cloud.google.com/blog/topics/threat-intelligence/threat-actors-generative-ai-limited/.
2. Lin, C.Y., et al., 2003. Automatic Evaluation of Summaries Using n-gram Co-occurrence Statistics. [online]
Available at: https://aclanthology.org/N03-1020.pdf.
3. Papineni, K., et al., 2002. BLEU: A Method for Automatic Evaluation of Machine Translation. [online] Available
at: https://aclanthology.org/P02-1040.pdf.
4. Zhang, T., et al., 2019. BERTScore: Evaluating Text Generation with BERT. [online] Available at: https://
openreview.net/attachment?id=SkeHuCVFDr&name=original_pdf.
5. Google, 2023. 5 myths about medical AI, debunked. [online] Available at: https://blog.google/technology/
health/5-myths-about-medical-ai-debunked/.
6. Bommasani, R., et al., 2021. On the opportunities and risks of foundation models. arXiv preprint
arXiv:2108.07258. [online] Available at: https://arxiv.org/pdf/2108.07258.
7. Singhal, K., et al., 2023. Large language models encode clinical knowledge. Nature, 620(7972), pp.172-180.
[online] Available at: https://www.nature.com/articles/s41586-023-06291-2.
8. Jin, D., et al., 2021. What disease does this patient have? a large-scale open domain question answering
dataset from medical exams. Applied Sciences, 11(14), p.6421.
9. Singhal, K., et al., 2023. Towards expert-level medical question answering with large language models. arXiv
preprint arXiv:2305.09617. [online] Available at: https://arxiv.org/abs/2305.09617.
10. Bernstein, I.A., et al., 2023. Comparison of ophthalmologist and large language model chatbot responses
to online patient eye care questions. JAMA Network Open, 6(8), pp.e2330320-e2330320. [online] Available at:
https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2808557.
11. Beede, E., et al., 2019. A Human-Centered Evaluation of a Deep Learning System Deployed in Clinics for the
Detection of Diabetic Retinopathy. [online] Available at: https://dl.acm.org/doi/abs/10.1145/3313831.3376718.
12. Pedersen, S., et al., 2021. Redesigning Clinical Pathways for Immediate Diabetic Retinopathy Screening
Results. NEJM Catalyst, July. [online] Available at: https://catalyst.nejm.org/doi/pdf/10.1056/CAT.21.0096.
13. Google, 2023. Google I/O Keynote 2023. [online] Available at: https://www.youtube.com/live/
cNfINi5CNbY?si=jQFi-Y3mG0rGD3Xd&t=810.
September 2024 36
Solving Domain-Specific Problems Using LLMs
14. Galatzer-Levy, I.R., et al., 2023. The Capability of Large Language Models to Measure Psychiatric
Functioning. arXiv preprint arXiv:2308.01834. [online] Available at: https://arxiv.org/abs/2308.01834.
15. Tu, T., et al., 2023. Towards generalist biomedical AI. arXiv preprint arXiv:2307.14334. [online] Available at:
https://arxiv.org/abs/2307.14334.
16. Liu, X., et al., 2023. Large Language Models are Few-Shot Health Learners. arXiv:2305.15525. [online]
Available at: https://arxiv.org/abs/2305.15525.
17. Belyaeva, A., et al., 2023. Multimodal LLMs for health grounded in individual-specific data. arXiv:2307.09018.
[online] Available at: https://arxiv.org/abs/2307.09018.
18. Shawn, X., et al., 2022. ELIXR: Towards a general purpose X-ray artificial intelligence system through
alignment of large language models and radiology vision encoders. arXiv:2308.01317. [online] Available at:
https://arxiv.org/abs/2308.01317.
19. Tu, T., et al., 2023. Genetic Discovery Enabled by a Large Language Model. [online] Available at: https://www.
biorxiv.org/content/10.1101/2023.11.09.566468v1.full.pdf.
September 2024 37