Assessing RAG Models For Health Chatbots
Assessing RAG Models For Health Chatbots
Table 1: List of models tested. “En” for English, “Hi” for Hindi, “Ka” for Kannada, “Ta” for Tamil, “Te” for Telugu,
and “All" refers to all the aforementioned languages. All Indic models are open-weights.
that GPT-4 performs well, as the ground truth re- Figure 1: Percentage Agreement between human and
sponses are based on responses generated by GPT- LLM-evaluators for English. The red line indicates the
4. The P HI -3.5-M O E- INSTRUCT model also per- average PA across models.
forms well on all metrics, followed by M ISTRAL -
L ARGE -I NSTRUCT-2407 and OPEN - ADITI - HI - V 4,
which is the only Indic model that performs near 4.3 Qualitative Analysis
the top even for English queries. Surprisingly, the One of the authors of the paper performed a qualita-
M ETA -L LAMA -3.1-70B-I NSTRUCT model per- tive analysis of responses from the evaluated LLMs
forms worse than expected on this task, frequently on 100 selected patient questions. The questions
regurgitating the entire prompt that was provided. were chosen to cover a range of medical topics and
In general, all models get higher scores on concise- languages. Thematic analysis involved (1) initial fa-
ness and many models do well on coherence. miliarization with the queries and associated LLM
For the non-English queries, which are far fewer responses, (2) theme identification where 5 themes
in number compared to English (Tables 3, 5, 6, were generated and (3) thematic coding where the
4 in Appendix A.1), we find that models such as generated themes were applied to the 100 question-
AYA -23-35B perform near the top for Hindi along answer pairs. We briefly summarize these results
4
The formulation and wording of the metric was slightly below.
simplified to the annotators to better understand it. The five generated themes across queries were
Model AGG COH CON FC SS
Table 2: Metric-wise scores for English. The Proprietary , Open-Weights and Indic models are highlighted
appropriately. All Indic models are open-weights.
(1) misspelling of English words, (2) code-mixing, queries were responded to in a manner that mir-
(3) non-native English, (4) relevance to cultural rored the GT answer. However, less common terms
context and (5) specificity to the patient’s condi- were not understood when mixed with English. The
tion. query with the word “Karwat” as mentioned in 3.1
For queries that involve misspellings (such as received responses ranging from “you can start
“saving” and “sarjere” mentioned in Section 3.1), cooking after 1 week” to “I’m sorry, but I cannot
many evaluated LLM were not able to come up provide an answer to your question. The infor-
with an appropriate response. For the query with mation you are seeking is not relevant to cataract
the word “saving", responses varied from “The pa- surgery or the guidelines for post-operative care”
tient should not be saved for more than 15 days to “be careful when children get near you”. Most of
after the surgery” to “Saving should not be done the evaluated LLMs understood the use of “sugar”
after surgery” to “You should not strain to pass in reference to diabetes, as well as sentences fol-
motion for 15 days after the surgery. If you are con- lowing different syntax than would be common in
stipated, it is recommended to consult the doctor”. native English.
All of these responses deviate from the GPT-4 The responses for culturally-relevant questions
generated GT, which said “You can have a shave varied greatly between evaluated LLMs. For ex-
after the cataract surgery. However, you should ample, to the question on appropriateness of cha-
avoid having a head bath or shampoo for 15 days pati and puri on the day of surgery, some LLMs
post-surgery.” approved, saying “Yes, he can take chapati, Puri
In cases of code mixing and Indian English, etc on the day of cataract surgery” while others
LLMs were more robust in their responses than were against this, saying “You should have a light
to misspellings. The term “Kanna operation” was meal before the surgery. Avoid heavy or oily foods
well understood by most models, and Hinglish like chapati and Puri on the day of your cataract
surgery. It’s best to stick to easily digestible foods. Multilingual vs. Indic models We evaluate sev-
If you have any specific dietary concerns, please eral models that are specifically fine-tuned on In-
discuss them with your healthcare team”. Ques- dic languages and on Indic data and observe that
tions relating to returning to a “native place” were they do not always perform well on non-English
met with refusals by around half of the evaluated queries. This could be because several instruction
LLMs. tuned models are tuned on synthetic instruction
Questions that were specific to the patient’s con- data which is usually a translation of English in-
dition were also responded to in a diverse manner struction data. A notable exception is the AYA -
by the evaluated LLMs. For example, the query 23-35B model, that contains manually created in-
“Can aztolet20 (atorvastatin and clopidogrel) tablet struction tuning data for different languages and
be taken post surgery” had the GT response “I do performs well for Hindi. Additionally, several mul-
not know the answer to your question. If this needs tilingual instruction tuning datasets have short in-
to be answered by a doctor, please schedule a con- structions, which may not be suitable for complex
sultation” as there was no mention of this medica- RAG settings, which typically have longer prompts
tion in the KB. However, some LLMs approved it’s and large chunks of data.
use, responding “Yes, you can take the aztolet20
Human vs. LLM-based evaluation We con-
(atorvastatin and clopidogrel) tablet post cataract
duct human evaluation on a subset of models and
surgery. However, it is important to follow your
data points and observe strong alignment with
doctor’s prescription and instructions” while oth-
the LLM evaluator overall, especially regarding
ers responded with mentions of medication that was
the final ranking of the models. However, for
referred to in the KB, “If you are referring to the
certain models like M ISTRAL -L ARGE -I NSTRUCT-
IMOL tablet, it is usually taken when you experi-
2407 (for Telugu) and M ETA -L LAMA -3.1-70B-
ence pain. However, for other medications, please
I NSTRUCT (for other languages), the agreement
share the name so I can provide a more accurate
is low. It is important to note that we use LLM-
answer. Always remember to follow your doctor’s
evaluators both with and without references, and as-
prescription.” Around half refused to answer the
sess human agreement for S EMANTIC S IMILARITY
question, mirroring the GT.
which uses ground truth references. This suggests
that LLM-evaluators should be used cautiously in
5 Discussion a multilingual context, and we plan to broaden hu-
man evaluation to include more metrics in future
In this study, we evaluated 24 models on healthcare-
work.
related queries in the RAG setting. Our findings
revealed many insights which we share below: Evaluation in controlled settings with uncon-
taminated datasets We evaluate 24 models in
Difference in model scores We find that the mod- an identical setting, leading to a fair comparison
els that we evaluate vary widely in their scores. between models. Our dataset is curated based on
This indicates that not all models are suitable for questions from users of an application and is not
use in the healthcare setting, and we find that some contaminated in the training dataset of any of the
models perform worse than expected. For example, models we evaluate, lending credibility to the re-
GPT-4 O and M ETA -L LAMA -3.1-70B-I NSTRUCT sults and insights we gather.
perform worse than smaller models on this task.
Locally-grounded, non-translated datasets
English vs. Multilingual Queries Although the Our dataset includes various instances of code-
number of non-English queries is small, we find switching, Indian English colloquialisms, and
that some Indic models perform better on English culturally specific questions which cannot be
queries than non-English queries. We also observe obtained by translating datasets, particularly with
that the Factual Correctness score is lower for non- automated translations. While models were able
English queries than English queries on average, to handle code-switching to a certain extent,
indicating that models find it difficult to answer responses varied greatly to culturally-relevant
non-English queries accurately. This may be due questions. This underscores the importance of
to the cultural and linguistic nuances present in our collecting datasets from target populations while
queries. building models or systems for real-world use.
6 Limitations Institutional Review All aspects of this research
were reviewed and approved by the Institutional Re-
Our work is subject to several limitations. view Board of our organization and also approved
by K ARYA.
• Because our dataset is derived from actual
users of a healthcare bot, we couldn’t regulate Data Our study is conducted in collaboration
the ratio of English to non-English queries. with K ARYA, that pays workers several times the
Consequently, the volume of non-English minimum wage in India and provides them with
queries in our dataset is significantly lower dignified digital work. Workers were paid 15 INR
than that of English queries, meaning the re- per datapoint for this study. Each datapoint took
sults on non-English queries should not be approximately 4 minutes to evaluate.
considered definitive. Similarly, since the
Annotator Demographics All annotators were
H EALTH B OT is available only in four In-
native speakers of the languages that they were
dian languages, we also could not evaluate
evaluating. Other annotator demographics were
on languages beyond these. The scope of our
not collected for this study.
H EALTH B OT setting is currently confined to
queries from patients at one hospital in India, Annotation Guidelines K ARYA provided anno-
resulting in less varied data. We intend to ex- tation guidelines and training to all workers.
pand this study as H EALTH B OT extends its
Compute/AI Resources All our experiments
reach to other parts of the country.
were conducted on 4 × A100 80Gb PCIE GPUs.
• While we evaluated numerous models in this The API calls to the GPT models were done
work, some were excluded from this study through the Azure OpenAI service. We also ac-
for various reasons, such as ease of access. knowledge the usage of ChatGPT and GitHub
We aim to incorporate more models in future CoPilot for building our codebase, and for refining
research. the writing of the paper.
Meta-Llama-3.1-70B-Instruct
arXiv:2408.00727. Meta-Llama-3.1-70B-Instruct
Tamil
gpt-4o
Cheng Xu, Shuhao Guan, Derek Greene, and M-Tahar Indic-gemma-7b-finetuned-sft-Navarasa-2.0
Kechadi. 2024. Benchmark data contamination of Mistral-Large-Instruct-2407
large language models: A survey. arXiv preprint Phi-3.5-MoE-instruct
Telugu
Phi-3.5-MoE-instruct
Deepika Yadav, Prerna Malik, Kirti Dabas, Pushpen- gpt-4o
dra Singh, Delhi Deepika YadavIndraprastha Insti- Meta-Llama-3.1-70B-Instruct
tute of Information Technology, Delhi Prerna Ma- Indic-gemma-7b-finetuned-sft-Navarasa-2.0
Hao Yu, Aoran Gan, Kai Zhang, Shiwei Tong, Qi Liu, - You are an cataract chatbot whose primary goal is to help patients
and Zhaofeng Liu. 2024. Evaluation of retrieval- undergoing or undergone a cataract surgery.
- If the query can be truthfully and factually answered using the
augmented generation: A survey. arXiv preprint knowledge base only, answer it concisely in a polite and professional
arXiv: 2405.07437. way. If not, then just say: ‘I do not know the answer to your
question. If this needs to be answered by a doctor, please schedule
a consultation.‘
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan - Incase of a conflict between raw knowledge base and new knowledge
base, always prefer the new knowledge base, and the latest source
Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, in the new knowledge base. Note that, either the raw knowledge base
Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, or the new knowledge base can be empty.
- The provided query is in {query_lang}, and you must always respond
Joseph E. Gonzalez, and Ion Stoica. 2023. Judging in {response_lang}.
LLM-as-a-judge with MT-bench and chatbot arena. - Do not generate any other opening or closing statements or remarks.
“name": “Coherence",
“description": "Coherence assesses the logical flow of the response, ensuring that one idea leads smoothly to the next. A coherent response should
present information in a structured manner, making it easy for the reader to follow the thought process without confusion.",
"scoring": {
"0": {
"(a)": "The response is highly disorganized and lacks a clear structure, making it difficult to follow.",
"(b)": "Sentences or ideas appear out of order or are disconnected, resulting in a confusing or jarring reading experience.",
"(c)": "The overall message is unclear due to poor organization."
},
"1": {
"(a)": "The response has some structure but includes noticeable breaks in the logical flow.",
"(b)": "Transitions between ideas may be abrupt, or there may be gaps in the reasoning, forcing the reader to make extra effort
to follow along.",
"(c)": "While the main point is evident, the flow is inconsistent."
},
"2": {
"(a)": "The response is well-organized and flows logically from one idea to the next.",
"(b)": "Each point builds naturally on the previous one, creating a clear and cohesive narrative.",
"(c)": "The reader can easily follow the thought process without having to backtrack or piece together disjointed information."
}
}
"scoring": {
"0": {
"(a)": "The response is overly verbose, including repeated information, irrelevant details, or excessive explanations.",
"(b)": "It takes far longer than necessary to convey the intended message, making it inefficient and difficult to read."
},
"1": {
"(a)": "The response is somewhat concise but includes some unnecessary information or redundant points.",
"(b)": "While the main message is clear, the response could be made more efficient by removing repetition or streamlining explanations."
},
"2": {
"(a)": "The response is highly concise, delivering all relevant information in a brief and efficient manner.",
"(b)": "There is no repetition, and every sentence serves a clear purpose.",
"(c)": "The message is conveyed succinctly, without sacrificing clarity or detail."
}
}
"scoring": {
"0": {
"(a)": "The response contains one or more significant factual errors.",
"(b)": "Key facts, numbers, or data points are incorrect, misleading, or fabricated, and the response does not align with the ground-truth
or the knowledge base.",
"(c)": "The factual inaccuracies could lead to misunderstandings or incorrect conclusions."
},
"1": {
"(a)": "The response is partially accurate but contains minor factual inaccuracies or omissions.",
"(b)": "While the majority of facts are correct, some important details may be misstated or missing.",
"(c)": "The response captures the general truth but lacks precision or completeness in key factual areas."
},
"2": {
"(a)": "The response is factually accurate, with all critical facts, figures, and details aligned with the ground-truth answer and
knowledge base.",
"(b)": "There are no factual errors, and the information is presented with precision and correctness, making the response highly
reliable."
}
}
"scoring": {
"0": {
"(a)": "he prediction does not align with the ground truth in terms of key facts, numbers, or critical phrases.",
"(b)": "The core meaning of the prediction diverges entirely from the ground-truth.",
"(c)": "The differences would lead to misunderstandings or incorrect conclusions about the core message."
},
"1": {
"(a)": "The prediction contains some similarities to the ground truth, with some key facts, numbers, and phrases being correctly aligned.",
"(b)": "However, the prediction is missing some information or contains some added information.",
"(c)": "This causes the prediction to fail at encapsulating the entire core meaning present in the ground truth."
},
"2": {
"(a)": "The prediction is semantically similar to the ground-truth, with key facts, numbers, and phrases correctly aligned.",
"(b)": "Any differences are minor and do not significantly alter the core meaning or factual accuracy.",
"(c)": "The essential message of the prediction matches that of the ground-truth."
},
}
Table 7: Human and LLM ranking according to the direct assessment. The value in the bracket denotes the average
score of the metric S EMANTIC S IMILARITY which was used for the evaluation.