0% found this document useful (0 votes)
55 views17 pages

Assessing RAG Models For Health Chatbots

GENAI

Uploaded by

vicky.sonawane3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views17 pages

Assessing RAG Models For Health Chatbots

GENAI

Uploaded by

vicky.sonawane3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

H EALTH -PARIKSHA: Assessing RAG Models for Health Chatbots in

Real-World Multilingual Settings

Varun Gumma♠ Anandhita Raghunath* ♢


Mohit Jain†♠ Sunayana Sitaram†♠

Microsoft Corporation ♢ University of Washington
[email protected], [email protected]

Abstract et al., 2024; Oren et al., 2024; Xu et al., 2024),


meaning that they often appear in LLM training
Assessing the capabilities and limitations of
data. Some of these benchmarks were originally
large language models (LLMs) has garnered
created for conventional Natural Language Process-
arXiv:2410.13671v1 [cs.CL] 17 Oct 2024

significant interest, yet the evaluation of mul-


tiple models in real-world scenarios remains ing tasks and may not fully represent current prac-
rare. Multilingual evaluation often relies on tical applications of LLMs (Conneau et al., 2018;
translated benchmarks, which typically do not Pan et al., 2017). Recently, there has been growing
capture linguistic and cultural nuances present interest in assessing LLMs within multilingual and
in the source language. This study provides an multicultural contexts (Ahuja et al., 2023, 2024;
extensive assessment of 24 LLMs on real world Faisal et al., 2024; Watts et al., 2024; Chiu et al.,
data collected from Indian patients interacting
with a medical chatbot in Indian English and 4
2024). Traditionally, these benchmarks were devel-
other Indic languages. We employ a uniform oped by translating English versions into various
Retrieval Augmented Generation framework to languages. However, due to the loss of linguistic
generate responses, which are evaluated using and cultural context during translation, new bench-
both automated techniques and human evalu- marks specific to different languages and cultures
ators on four specific metrics relevant to our are now being created. However, such benchmarks
application. We find that models vary signifi- are few in number, and several of the older ones are
cantly in their performance and that instruction
contaminated in training data (Ahuja et al., 2024;
tuned Indic models do not always perform well
on Indic language queries. Further, we empiri- Oren et al., 2024). Thus, there is a need for new
cally show that factual correctness is generally benchmarks that can test the abilities of models in
lower for responses to Indic queries compared real-world multilingual settings.
to English queries. Finally, our qualitative work LLMs are employed in various fields, including
shows that code-mixed and culturally relevant critical areas like healthcare. Jin et al. (2024) trans-
queries in our dataset pose challenges to evalu- late an English healthcare dataset into Spanish, Chi-
ated models. nese, and Hindi, and demonstrate that performance
1 Introduction declines in these languages compared to English.
This highlights the necessity of examining LLMs
Large Language Models (LLMs) have demon- more thoroughly in multilingual contexts for these
strated impressive proficiency across various do- important uses.
mains. Nonetheless, their full spectrum of capabil- In this study, we conduct the first comprehensive
ities and limitations remains unclear, resulting in assessment of multilingual models within a real-
unpredictable performance on certain tasks. Ad- world healthcare context. We evaluate responses
ditionally, there is now a wide selection of LLMs from 24 multilingual and Indic models using 750
available. Therefore, evaluation has become cru- questions posed by users of a health chatbot in
cial for comprehending the internal mechanisms of five languages (Indian English and four Indic lan-
LLMs and for comparing them against each other. guages). All the models being evaluated function
Despite the importance of evaluation, significant within the same RAG framework, and their out-
challenges still persist. Many widely-used bench- puts are compared to doctor-verified ground truth
marks for assessing LLMs are contaminated (Ahuja responses. We evaluate LLM responses on four
* Work done during an internship at Microsoft metrics curated for our application, including fac-

Equal Advising tual correctness, semantic similarity, coherence,
and conciseness and present leaderboards for each 2 Related Works
metric, as well as an overall leaderboard. We use
Healthcare Chatbots in India Within the Indian
human evaluation and automated methods (LLMs-
context, the literature has documented great diver-
as-a-judge) to compute these metrics by compar-
sity in health seeking and health communication
ing LLM responses with ground-truth reference
behaviors based on gender (Das et al., 2018), vary-
responses or assessing the responses in a reference-
ing educational status, poor functional literacy, cul-
free manner.
tural context (Islary, 2018), stigmas (Wang et al.)
Our results suggest that models vary significantly etc. This diversity in behavior may translate to
in their performance, with some smaller models people’s use of medical chatbots, which are in-
outperforming larger ones. Factual Correctness is creasingly reaching hundreds of Indian patients at
generally lower for non-English queries compared the margins of the healthcare system (Mishra et al.,
to English queries. We observe that instruction- 2023). These bots solicit personal health informa-
tuned Indic models do not always perform well tion directly from patients in their native Indic lan-
on Indic language queries. Our dataset contains guages or in Indic English. For example, (Ramjee
several instances of code-mixed and culturally- et al., 2024) find that their CataractBot deployed in
relevant queries, which models sometimes struggle Bangalore, India yields patient questions on topics
to answer. The contributions of our work are as such as surgery, preoperative preparation, diet, exer-
follows: cise, discharge, medication, pain management, etc.
Mishra et al. (2023) find that Indian people share
• We evaluate 24 models (proprietary as well “deeply personal questions and concerns about sex-
as open weights) in a healthcare setting using ual and reproductive health” with their chatbot
queries provided by patients using a medical SnehAI. Yadav et al. (2019) find that queries to
chatbot. This guarantees that our dataset is chatbots are “embedded deeply into a communi-
not contaminated in the training data of any ties myths and existing belief systems” while (Xiao
of the models we evaluate. et al., 2023) note that patients have difficulties find-
ing health information at an appropriate level for
• We curate a dataset of queries from multilin- them to comprehend. Therefore, LLMs powering
gual users that spans multiple languages. The medical chatbots in India and other Low and Mid-
queries feature language typical of multilin- dle Income Countries are challenged to respond
gual communities, such as code-switching, lucidly to medical questions that are asked in ways
which is rarely found in translated datasets, that may be hyperlocal to patient context. Few
making ours a more realistic dataset for model works have documented how LLMs react to this
evaluation. linguistic diversity in the medical domain. Our
work begins to bridge this gap.
• We evaluate several models in an identical Multilingual and RAG evaluation Several pre-
RAG setting, making it possible to compare vious studies have conducted in-depth evaluation
models in a fair manner. The RAG setting is a of Multilingual capabilities of LLMs by evaluating
popular configuration that numerous models across standard tasks (Srivastava et al., 2022; Liang
are being deployed in for real-world applica- et al., 2023; Ahuja et al., 2023, 2024; Asai et al.,
tions. 2024; Lai et al., 2023; Robinson et al., 2023), with
a common finding that current LLMs only have a
• We establish relevant metrics for our applica- limited multilingual capacity. Other works (Watts
tion and determine an overall combined metric et al., 2024; Leong et al., 2023) include evaluating
by consulting domain experts - doctors work- LLMs on creative and generative tasks. Salemi and
ing on the medical chatbot project. Zamani (2024) state that evaluating RAG models
require a joint evaluating of the retrival and gen-
• We perform assessments (with and without erated output. Recent works such as Chen et al.
ground truth references) using LLM-as-a- (2024); Chirkova et al. (2024) benchmark LLMs
judge and conduct human evaluations on a as RAG models in bilingual and multilingual se-
subset of the models and data to confirm the tups. Lastly, several tools and benchmarks have
validity of the LLM assessment. also been built for automatic evaluation of RAG,
even in medical domains (Es et al., 2024; Tang and languages - English, Hindi, Kannada, Tamil, Tel-
Yang, 2024; Xiong et al., 2024a,b), and we refer the ugu. The workflow of chatting with H EALTH B OT
readers to Yu et al. (2024) for such a comprehensive was as follows: Patients sent questions through
list and survey. the WhatsApp interface to H EALTH B OT. Their
questions were transcribed automatically (using a
LLM-based Evaluators With the advent of
speech recognition system) and translated (using
large-scale instruction following capabilities in
an off-the-shelf translator) into English if needed,
LLMs, automatic evaluations with the help of these
after which GPT-4 was used to to produce an initial
models is being preferred (Kim et al., 2024a,b; Liu
response by performing RAG on the documents
et al., 2024; Shen et al., 2023; Kocmi and Feder-
in the knowledge base (KB, see below). This ini-
mann, 2023). However, it has been shown that it is
tial response was passed to doctors who reviewed,
optimal to assess these evaluations in tandem with
validated, and if needed, edited the answer. The
human annotations as LLMs can provide inflated
doctor approved answer is henceforth referred to
scores (Hada et al., 2024b,a; Watts et al., 2024).
as the ground truth (GT) response associated with
Other works (Zheng et al., 2023; Watts et al., 2024)
the patient query.
have employed GPT-4 alongside human evalua-
Our evaluation dataset was curated from this
tors to leaderboards to assess other LLMs. Ning
data by including all questions sent to H EALTH B OT
et al. (2024) proposed an innovative approach using
along with their associated GT response. Exclusion
LLMs for peer review, where models evaluate each
criteria removed exact duplicate questions, those
other’s outputs. However, a recent study by Dodda-
with personally identifying information, and those
paneni et al. (2024) highlighted the limitations of
not relevant to health. Additionally, for this work,
LLM-based evaluators, revealing their inability to
we only consider questions to which the GPT-4
reliably detect subtle drops in input quality during
answer was directly approved by the expert as the
evaluations, raising concerns about their precision
“correct and complete answer" without additional
and dependability for fine-grained assessments. In
editing on the doctors’ part. The final dataset con-
this work, we use LLM-based evaluators both with
tained 749 question and GT answer pairs that were
and without ground-truth references and also use
sent in to H EALTH B OT between December 2023
human evaluation to validate LLM-based evalua-
to June 2024. In the pool, 666 questions were in
tion.
English, 19 in Hindi, 27 in Tamil, 14 in Telugu,
3 Methodology and 23 in Kannada. Note that, queries written in
the script of a specific language were classified as
In this study, we leveraged a dataset collected from belonging to that language. For code-mixed and
a deployed medical chatbot. Here, we provide an Romanized queries, we determined whether they
overview of the question dataset, the knowledge were English or non-English based on the matrix
base employed for answering those questions, the language of the query.
process for generating responses, and the evalua- The evaluation dataset consists of queries that (1)
tion framework. have misspelled English words, (2) are code-mixed,
(3) represent non-native English, (4) are relevant to
3.1 Data the patient’s cultural context and (5) are specific to
The real-world test data was collected by our col- the patient’s condition. We provide some examples
laborators as part of an ongoing research effort that of each of these categories.
designed and deployed a medical chatbot, hereafter Examples of misspelled queries include ques-
referred to as H EALTH B OT, to patients scheduled tions such as “How long should saving not be done
for cataract surgery at a large hospital in urban In- after surgery?” where the patient intended to ask
dia. An Ethics approval was obtained from our about shaving, and “Sarjere is don mam?” which
institution prior to conducting this work, and once the attendant used to inquire about the patient’s
enrolled in the study and consent was obtained, discharge status. Instances of code mixing can be
both the patient and their accompanying family seen in phrases like “Agar operation ke baad pain
member or attendant were instructed on how to use ho raha hai, to kya karna hai?” meaning “If there
H EALTH B OT on WhatsApp. Through this instruc- is pain after the surgery, what should I do?” in
tional phase, they were informed that questions Hindi-English (Hinglish). Other examples include
could be asked by voice or by text, in one of 5 “Can I eat before the kanna operation?” where
“kanna” means eye in Tamil, and “kanna operation” 1000 tokens, and embedded in VectorDB1 using
is a well understood, common way of referring the T EXT-E MBEDDING -A DA -0022 . Subsequently,
to cataract surgery, and “In how many days can for each query, the top 3 most relevant chunks are
a patient take Karwat?” where “Karwat” means extracted, and the models are queried with this data.
turning over in sleep in Hindi.
Indian English was used in a majority of the En- 3.3 Models
glish queries, making the phrasing of questions dif- We chose 24 models including proprietary multilin-
ferent from what they would be with native English gual models, as well as Open-weights multilingual
speech. Examples are as follows - “Because I have and Indic language models for our evaluation. A
diabetes sugar problem I am worried much”, “Why full list of models can be found in Table 1.
to eat light meal only? What comes under light
3.4 Response Generation
meal?” and “Is the patient should be in dark room
after surgery?” Taking a shower was commonly We use the standard Retrieval-Augmented-
referred to as “taking a bath”, and eye glasses Generation (RAG) strategy to elicit responses from
were commonly referred to as “goggles”, “spex” all the models. Each model is asked to respond
or “spectacles”. the given query by extracting the appropriate
Culturally-relevant questions were also many in pieces of text from the knowledge-base chunks.
number, for example questions about specific foods During prompting, we segregate the chunks
were asked like “Can he take chapati, Puri etc into R AW C HUNKS and KBU PDATE C HUNKS
on the day of surgery?” and “Can I eat non veg symbolizing the data from the standard sources,
after surgery?” (“non-veg” is a term used in Indian and the KB updates. Then model is explicitly
English to denote eating meat). Questions about instructed to prioritize the information from the
yoga were asked, like “How long after the surgery most latest sources, i.e. the KBU PDATE C HUNKS
should the Valsalva maneuver be avoided?” and (if they are available). The exact prompt using
“Are there any specific yoga poses I can do?”. The for generation is provided in Appendix X. Note
notion of a patient’s native place or village was that each model gets the same R AW C HUNKS and
brought up in queries such as “If a person gets KBU PDATE C HUNKS, which are also the same that
operated here and then goes to his native place are given to the GPT-4 model in the H EALTH B OT,
and if some problem occurs what shall he do ?” or based on which the GT responses are verified.
“Can she travel by car with AC for 100 kms ?”.
3.5 Response Evaluation
3.2 Knowledge Base We used both human and automated evaluation to
The documents populating the knowledge base evaluate the performance of models in the setup
(KB) were initially curated by doctors at the hos- described above. GPT-4o3 was employed as an
pital where H EALTH B OT was deployed. This con- LLM evaluator. We prompted the model separately
sisted of 12 PDF documents that were converted to judge each metric, as Hada et al. (2024b,a) show
into text files and manually error checked. The that individual calls reduce interaction and influ-
documents included Standard Operating Provedure ence among and their evaluations.
manuals, standard treatment guidelines, consent 3.5.1 LLM Evaluation
forms, frequently-asked-question documents, in-
In consultation with domain experts working on the
surance information, etc. Following this initial
H EALTH B OT, we curated metrics that are relevant
curation, doctors that were with H EALTH B OT were
for our application. We limit ourselves to 3 classes
able to select question-answer pairs to be added to
(Good - 2, Medium - 1, Bad - 0) for each metric, as
KB after the bot was deployed. In this manner, the
a larger number of classes could hurt interpretabil-
knowledge available to GPT-4 in the KB grew over
ity and lower LLM-evaluator performance. The
time. Therefore, every question that was asked by
prompt used for each of our metrics are available in
patients was associated with a different version of
Appendix A.2, and a general overview is provided
the KB being used for answer generation. This
below.
detail was incorporated into our evaluation in order 1
https://www.trychroma.com
to compare the verified ground truth data with the 2
https://platform.openai.com/docs/guides/
generated response in an accurate manner. All KB embeddings/embedding-models
3
documents were chunked to a maximum length of https://openai.com/index/hello-gpt-4o/
Languages
Models Availability
Tested
GPT-4 All Proprietary
GPT-4o All Proprietary
microsoft/Phi-3.5-MoE-instruct All Open-weights
CohereForAI/c4ai-command-r-plus-08-2024 All Open-weights
Qwen/Qwen2.5-72B-Instruct All Open-weights
CohereForAI/aya-23-35B All Open-weights
mistralai/Mistral-Large-Instruct-2407 All Open-weights
google/gemma-2-27b-it All Open-weights
meta-llama/Meta-Llama-3.1-70B-Instruct All Open-weights
GenVRadmin/llama38bGenZ_Vikas-Merged All Indic
GenVRadmin/AryaBhatta-GemmaOrca-Merged All Indic
GenVRadmin/AryaBhatta-GemmaUltra-Merged All Indic
GenVRadmin/AryaBhatta-GemmaGenZ-Vikas-Merged All Indic
Telugu-LLM-Labs/Indic-gemma-7b-finetuned-sft-Navarasa-2.0 All Indic
ai4bharat/Airavata En, Hi Indic
Cognitive-Lab/LLama3-Gaja-Hindi-8B-v0.1 En, Hi Indic
BhabhaAI/Gajendra-v0.1 En, Hi Indic
manishiitg/open-aditi-hi-v4 En, Hi Indic
abhinand/tamil-llama-7b-instruct-v0.2 En, Ta Indic
abhinand/telugu-llama-7b-instruct-v0.1 En, Te Indic
Telugu-LLM-Labs/Telugu-Llama2-7B-v0-Instruct En, Te Indic
Tensoic/Kan-Llama-7B-SFT-v0.5 En, Ka Indic
Cognitive-Lab/Ambari-7B-Instruct-v0.2 En, Ka Indic
GenVRadmin/Llamavaad En, Hi Indic

Table 1: List of models tested. “En” for English, “Hi” for Hindi, “Ka” for Kannada, “Ta” for Tamil, “Te” for Telugu,
and “All" refers to all the aforementioned languages. All Indic models are open-weights.

• FACTUAL C ORRECTNESS (FC): As Doddapa- the GT response verified by doctors as a refer-


neni et al. (2024) had shown that LLM-based ence, while C OHERENCE and C ONCISENESS are
evaluators fail to identify subtle factual inaccu- reference-free metrics. In order to arrive at a com-
racies, we curate a separate metric to double- bined score for each model, we asked two doc-
check facts like dates, numbers, procedure and tors who collaborate on the H EALTH B OT to assign
medicine names. weights to the first four metrics according to their
importance and used an average of the percentages
• S EMANTIC S IMILARITY (SS): Similarly, we for each metric as the final coefficient to compute
formulate another metric to specifically anal- the AGGREGATE (AGG). Both doctors gave the
yse if both the prediction and the ground-truth maximum weight to FACTUAL C ORRECTNESS fol-
response convey the same information seman- lowed by S EMANTIC S IMILARITY while C OHER -
tically, especially when the they are in differ- ENCE and C ONCISENESS were given lower and
ent languages. equal weightage.
• C OHERENCE (COH): This metric evaluates if 3.5.2 Human Evaluation
the model was able to stitch together appropri-
Following previous works (Hada et al., 2024b,a;
ate pieces of information from the three data
Watts et al., 2024), we augment the LLM eval-
chunks provided to yield a coherent response.
uation with human evaluation and draw cor-
• C ONCISENESS (CON): Since the knowledge relations between the LLM evaluator and hu-
base chunks extracted and provided to the man evaluation for a subset of the models
model can be quite large, with important facts (P HI -3.5-M O E- INSTRUCT, M ISTRAL -L ARGE -
embedded at different positions, we build this I NSTRUCT-2407, GPT-4 O, M ETA -L LAMA -3.1-
metric to assess the ability of the model to 70B-I NSTRUCT, I NDIC - GEMMA -7 B - FINETUNED -
extract and compress all these bits of informa- SFT-NAVARASA -2.0). These models were selected
tion relevant to the query into a crisp response. based on results from early automated evaluations,
covering a range of scores and representing models
Among the metrics presented above, FACTUAL of interest.
C ORRECTNESS and S EMANTIC S IMILARITY use The human annotators were employed by
K ARYA, a data annotation company and were all na- with proprietary and large open weights models
tive speakers of Indian languages that we evaluated. such as Q WEN 2.5-72B-I NSTRUCT and M ISTRAL -
We selected a sample of 100 queries from English, L ARGE -I NSTRUCT-2407, outperforming many of
and all the queries from Indic languages for annota- the fine-tuned Indic LLMs. The GEMMA -2-27 B - IT
tion, yielding a total of 183 queries. Each instance model also outperforms many Indic models in the
was annotated by one annotator for S EMANTIC Indic setting, compared to its performance in En-
S IMILARITY between the model’s response and glish. This shows that some instruction-tuned Indic
the GT response provided by the doctor. The an- LLMs may not perform well in the RAG setting.
notations began with a briefing about the task and We also find that compared to English, models get
each of them was given a sample test task, and were lower values on FC on Indic queries, which is con-
provided some guidance based on their difficulties cerning as it is rated as the most important metric
and mistakes. Finally, the annotators were asked to by doctors.
evaluate the model response based on the metric4 ,
query, and ground-truth response on a scale of 0 to 4.2 Comparison of human and LLM
2, similar to the LLM-evaluator. evaluators
We perform human evaluation on five models on
4 Results the S EMANTIC S IMILARITY (SS) task and com-
pare human and LLM evaluation by inspecting the
In this section, we present the outcomes of both
ranking of the models in Appendix A.3. We find
the LLM and human evaluations. We begin by
that for all languages except Telugu, we get iden-
examining the average scores across all our metrics
tical rankings of all models. Additionally, we also
including the combined metric for English queries,
measure the Percentage Agreement (PA) between
followed by results for queries in other languages.
the human and LLM-evaluator, details of which
Next, we examine the ranking of models based on
can be found in the Appendix A.1 and find it to
scores given by human annotators and compare
be consistently higher than 0.7 on average across
these rankings based on scores provided by the
all languages and models. This shows the reliabil-
LLM evaluator. Lastly, we conduct a qualitative
ity of our LLM-based evaluation for S EMANTIC
analysis of the outcomes and describe noteworthy
S IMILARITY which uses the GT response as a ref-
findings.
erence.
4.1 LLM evaluator results English
gpt-4o
Mistral-Large-Instruct-2407
We see from Table 2 that for English, the best per- Indic-gemma-7b-finetuned-sft-Navarasa-2.0
Phi-3.5-MoE-instruct
forming models is the Q WEN 2.5-72B-I NSTRUCT Meta-Llama-3.1-70B-Instruct
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
model across all metrics. Note that it is expected Percentage Agreement

that GPT-4 performs well, as the ground truth re- Figure 1: Percentage Agreement between human and
sponses are based on responses generated by GPT- LLM-evaluators for English. The red line indicates the
4. The P HI -3.5-M O E- INSTRUCT model also per- average PA across models.
forms well on all metrics, followed by M ISTRAL -
L ARGE -I NSTRUCT-2407 and OPEN - ADITI - HI - V 4,
which is the only Indic model that performs near 4.3 Qualitative Analysis
the top even for English queries. Surprisingly, the One of the authors of the paper performed a qualita-
M ETA -L LAMA -3.1-70B-I NSTRUCT model per- tive analysis of responses from the evaluated LLMs
forms worse than expected on this task, frequently on 100 selected patient questions. The questions
regurgitating the entire prompt that was provided. were chosen to cover a range of medical topics and
In general, all models get higher scores on concise- languages. Thematic analysis involved (1) initial fa-
ness and many models do well on coherence. miliarization with the queries and associated LLM
For the non-English queries, which are far fewer responses, (2) theme identification where 5 themes
in number compared to English (Tables 3, 5, 6, were generated and (3) thematic coding where the
4 in Appendix A.1), we find that models such as generated themes were applied to the 100 question-
AYA -23-35B perform near the top for Hindi along answer pairs. We briefly summarize these results
4
The formulation and wording of the metric was slightly below.
simplified to the annotators to better understand it. The five generated themes across queries were
Model AGG COH CON FC SS

Q WEN 2.5-72B-I NSTRUCT 1.46 1.86 1.96 1.62 1.43


GPT-4 1.40 1.71 1.95 1.56 1.36
P HI -3.5-M O E- INSTRUCT 1.29 1.65 1.93 1.43 1.22
M ISTRAL -L ARGE -I NSTRUCT-2407 1.29 1.60 1.95 1.42 1.24
OPEN - ADITI - HI - V 4 1.27 1.69 1.85 1.37 1.22
L LAMAVAAD 1.16 1.34 0.97 1.36 1.20
A RYA B HATTA -G EMMAG EN Z-V IKAS -M ERGED 1.12 1.48 1.65 1.22 1.07
K AN -L LAMA -7B-SFT- V 0.5 1.01 1.39 1.64 1.07 0.97
GEMMA -2-27 B - IT 1.00 1.28 1.88 1.07 0.91
A RYA B HATTA -G EMMAO RCA -M ERGED 0.97 1.32 1.62 1.03 0.92
LL AMA 3-G AJA -H INDI -8B- V 0.1 0.91 0.63 1.65 1.09 0.98
GPT-4 O 0.91 1.08 1.78 0.98 0.87
AYA -23-35B 0.91 1.09 1.65 1.00 0.83
G AJENDRA - V 0.1 0.88 1.21 1.38 0.93 0.85
C 4 AI - COMMAND - R - PLUS -08-2024 0.82 1.15 1.48 0.85 0.74
TAMIL - LLAMA -7 B - INSTRUCT- V 0.2 0.81 1.13 1.50 0.83 0.75
A IRAVATA 0.80 1.03 1.38 0.85 0.78
A MBARI -7B-I NSTRUCT- V 0.2 0.73 0.86 1.11 0.76 0.82
M ETA -L LAMA -3.1-70B-I NSTRUCT 0.65 0.55 1.12 0.77 0.67
T ELUGU -L LAMA 2-7B- V 0-I NSTRUCT 0.51 0.60 1.12 0.53 0.53
LLAMA 38 B G EN Z_V IKAS -M ERGED 0.51 0.52 1.09 0.55 0.53
I NDIC - GEMMA -7 B - FINETUNED - SFT-NAVARASA -2.0 0.35 0.32 0.53 0.40 0.39
A RYA B HATTA -G EMMAU LTRA -M ERGED 0.32 0.38 1.19 0.31 0.27
TELUGU - LLAMA -7 B - INSTRUCT- V 0.1 0.04 0.00 0.58 0.03 0.00

Table 2: Metric-wise scores for English. The Proprietary , Open-Weights and Indic models are highlighted
appropriately. All Indic models are open-weights.

(1) misspelling of English words, (2) code-mixing, queries were responded to in a manner that mir-
(3) non-native English, (4) relevance to cultural rored the GT answer. However, less common terms
context and (5) specificity to the patient’s condi- were not understood when mixed with English. The
tion. query with the word “Karwat” as mentioned in 3.1
For queries that involve misspellings (such as received responses ranging from “you can start
“saving” and “sarjere” mentioned in Section 3.1), cooking after 1 week” to “I’m sorry, but I cannot
many evaluated LLM were not able to come up provide an answer to your question. The infor-
with an appropriate response. For the query with mation you are seeking is not relevant to cataract
the word “saving", responses varied from “The pa- surgery or the guidelines for post-operative care”
tient should not be saved for more than 15 days to “be careful when children get near you”. Most of
after the surgery” to “Saving should not be done the evaluated LLMs understood the use of “sugar”
after surgery” to “You should not strain to pass in reference to diabetes, as well as sentences fol-
motion for 15 days after the surgery. If you are con- lowing different syntax than would be common in
stipated, it is recommended to consult the doctor”. native English.
All of these responses deviate from the GPT-4 The responses for culturally-relevant questions
generated GT, which said “You can have a shave varied greatly between evaluated LLMs. For ex-
after the cataract surgery. However, you should ample, to the question on appropriateness of cha-
avoid having a head bath or shampoo for 15 days pati and puri on the day of surgery, some LLMs
post-surgery.” approved, saying “Yes, he can take chapati, Puri
In cases of code mixing and Indian English, etc on the day of cataract surgery” while others
LLMs were more robust in their responses than were against this, saying “You should have a light
to misspellings. The term “Kanna operation” was meal before the surgery. Avoid heavy or oily foods
well understood by most models, and Hinglish like chapati and Puri on the day of your cataract
surgery. It’s best to stick to easily digestible foods. Multilingual vs. Indic models We evaluate sev-
If you have any specific dietary concerns, please eral models that are specifically fine-tuned on In-
discuss them with your healthcare team”. Ques- dic languages and on Indic data and observe that
tions relating to returning to a “native place” were they do not always perform well on non-English
met with refusals by around half of the evaluated queries. This could be because several instruction
LLMs. tuned models are tuned on synthetic instruction
Questions that were specific to the patient’s con- data which is usually a translation of English in-
dition were also responded to in a diverse manner struction data. A notable exception is the AYA -
by the evaluated LLMs. For example, the query 23-35B model, that contains manually created in-
“Can aztolet20 (atorvastatin and clopidogrel) tablet struction tuning data for different languages and
be taken post surgery” had the GT response “I do performs well for Hindi. Additionally, several mul-
not know the answer to your question. If this needs tilingual instruction tuning datasets have short in-
to be answered by a doctor, please schedule a con- structions, which may not be suitable for complex
sultation” as there was no mention of this medica- RAG settings, which typically have longer prompts
tion in the KB. However, some LLMs approved it’s and large chunks of data.
use, responding “Yes, you can take the aztolet20
Human vs. LLM-based evaluation We con-
(atorvastatin and clopidogrel) tablet post cataract
duct human evaluation on a subset of models and
surgery. However, it is important to follow your
data points and observe strong alignment with
doctor’s prescription and instructions” while oth-
the LLM evaluator overall, especially regarding
ers responded with mentions of medication that was
the final ranking of the models. However, for
referred to in the KB, “If you are referring to the
certain models like M ISTRAL -L ARGE -I NSTRUCT-
IMOL tablet, it is usually taken when you experi-
2407 (for Telugu) and M ETA -L LAMA -3.1-70B-
ence pain. However, for other medications, please
I NSTRUCT (for other languages), the agreement
share the name so I can provide a more accurate
is low. It is important to note that we use LLM-
answer. Always remember to follow your doctor’s
evaluators both with and without references, and as-
prescription.” Around half refused to answer the
sess human agreement for S EMANTIC S IMILARITY
question, mirroring the GT.
which uses ground truth references. This suggests
that LLM-evaluators should be used cautiously in
5 Discussion a multilingual context, and we plan to broaden hu-
man evaluation to include more metrics in future
In this study, we evaluated 24 models on healthcare-
work.
related queries in the RAG setting. Our findings
revealed many insights which we share below: Evaluation in controlled settings with uncon-
taminated datasets We evaluate 24 models in
Difference in model scores We find that the mod- an identical setting, leading to a fair comparison
els that we evaluate vary widely in their scores. between models. Our dataset is curated based on
This indicates that not all models are suitable for questions from users of an application and is not
use in the healthcare setting, and we find that some contaminated in the training dataset of any of the
models perform worse than expected. For example, models we evaluate, lending credibility to the re-
GPT-4 O and M ETA -L LAMA -3.1-70B-I NSTRUCT sults and insights we gather.
perform worse than smaller models on this task.
Locally-grounded, non-translated datasets
English vs. Multilingual Queries Although the Our dataset includes various instances of code-
number of non-English queries is small, we find switching, Indian English colloquialisms, and
that some Indic models perform better on English culturally specific questions which cannot be
queries than non-English queries. We also observe obtained by translating datasets, particularly with
that the Factual Correctness score is lower for non- automated translations. While models were able
English queries than English queries on average, to handle code-switching to a certain extent,
indicating that models find it difficult to answer responses varied greatly to culturally-relevant
non-English queries accurately. This may be due questions. This underscores the importance of
to the cultural and linguistic nuances present in our collecting datasets from target populations while
queries. building models or systems for real-world use.
6 Limitations Institutional Review All aspects of this research
were reviewed and approved by the Institutional Re-
Our work is subject to several limitations. view Board of our organization and also approved
by K ARYA.
• Because our dataset is derived from actual
users of a healthcare bot, we couldn’t regulate Data Our study is conducted in collaboration
the ratio of English to non-English queries. with K ARYA, that pays workers several times the
Consequently, the volume of non-English minimum wage in India and provides them with
queries in our dataset is significantly lower dignified digital work. Workers were paid 15 INR
than that of English queries, meaning the re- per datapoint for this study. Each datapoint took
sults on non-English queries should not be approximately 4 minutes to evaluate.
considered definitive. Similarly, since the
Annotator Demographics All annotators were
H EALTH B OT is available only in four In-
native speakers of the languages that they were
dian languages, we also could not evaluate
evaluating. Other annotator demographics were
on languages beyond these. The scope of our
not collected for this study.
H EALTH B OT setting is currently confined to
queries from patients at one hospital in India, Annotation Guidelines K ARYA provided anno-
resulting in less varied data. We intend to ex- tation guidelines and training to all workers.
pand this study as H EALTH B OT extends its
Compute/AI Resources All our experiments
reach to other parts of the country.
were conducted on 4 × A100 80Gb PCIE GPUs.
• While we evaluated numerous models in this The API calls to the GPT models were done
work, some were excluded from this study through the Azure OpenAI service. We also ac-
for various reasons, such as ease of access. knowledge the usage of ChatGPT and GitHub
We aim to incorporate more models in future CoPilot for building our codebase, and for refining
research. the writing of the paper.

• Research has indicated that LLM-based eval-


8 Acknowledgements
uators tend to prefer their own responses. In We thank Aditya Yadavalli, Vivek Seshadri, the
our evaluations, we use GPT-4 O, and there Operations team and Annotators from K ARYA for
may be a bias leading to higher scores for the streamlined annotation process. We also extend
the GPT-4 O model and other models within our graditude to Bhuvan Sachdeva for helping us
the GPT family. Although not investigated in with the H EALTH B OT deployment, data collection
prior research, it is also conceivable that mod- and organization process.
els fine-tuned with synthetic data generated by
GPT-4 O might receive elevated scores. We
urge readers to keep these in mind while inter- References
preting the scores. In future work, we plan to Kabir Ahuja, Harshita Diddee, Rishav Hada, Milli-
use multiple LLM-evaluators to obtain more cent Ochieng, Krithika Ramesh, Prachi Jain, Ak-
robust results. shay Nambi, Tanuja Ganu, Sameer Segal, Mohamed
Ahmed, Kalika Bali, and Sunayana Sitaram. 2023.
MEGA: Multilingual evaluation of generative AI.
• Finally, our human evaluation was limited to In Proceedings of the 2023 Conference on Empir-
a subset of models and data, and a single met- ical Methods in Natural Language Processing, pages
ric due to time and budget constraints. In 4232–4267, Singapore. Association for Computa-
future work, we plan to incorporate more hu- tional Linguistics.
man evaluation, as well as qualitative analysis Sanchit Ahuja, Divyanshu Aggarwal, Varun Gumma,
of the results. Ishaan Watts, Ashutosh Sathe, Millicent Ochieng,
Rishav Hada, Prachi Jain, Mohamed Ahmed, Kalika
7 Ethical Considerations Bali, and Sunayana Sitaram. 2024. MEGAVERSE:
Benchmarking large language models across lan-
guages, modalities, models and tasks. In Proceed-
We use the framework by Bender and Friedman ings of the 2024 Conference of the North American
(2018) to discuss the ethical considerations for our Chapter of the Association for Computational Lin-
work. guistics: Human Language Technologies (Volume
1: Long Papers), pages 2598–2637, Mexico City, System Demonstrations, pages 150–158, St. Julians,
Mexico. Association for Computational Linguistics. Malta. Association for Computational Linguistics.
Akari Asai, Sneha Kudugunta, Xinyan Yu, Terra Fahim Faisal, Orevaoghene Ahia, Aarohi Srivastava,
Blevins, Hila Gonen, Machel Reid, Yulia Tsvetkov, Kabir Ahuja, David Chiang, Yulia Tsvetkov, and
Sebastian Ruder, and Hannaneh Hajishirzi. 2024. Antonios Anastasopoulos. 2024. Dialectbench: A
BUFFET: Benchmarking large language models for nlp benchmark for dialects, varieties, and closely-
few-shot cross-lingual transfer. In Proceedings of related languages. arXiv preprint arXiv:2403.11009.
the 2024 Conference of the North American Chap-
ter of the Association for Computational Linguistics: Rishav Hada, Varun Gumma, Mohamed Ahmed, Ka-
Human Language Technologies (Volume 1: Long lika Bali, and Sunayana Sitaram. 2024a. METAL:
Papers), pages 1771–1800, Mexico City, Mexico. As- Towards multilingual meta-evaluation. In Findings
sociation for Computational Linguistics. of the Association for Computational Linguistics:
NAACL 2024, pages 2280–2298, Mexico City, Mex-
Emily M. Bender and Batya Friedman. 2018. Data ico. Association for Computational Linguistics.
statements for natural language processing: Toward
mitigating system bias and enabling better science. Rishav Hada, Varun Gumma, Adrian Wynter, Harshita
Transactions of the Association for Computational Diddee, Mohamed Ahmed, Monojit Choudhury, Ka-
Linguistics, 6:587–604. lika Bali, and Sunayana Sitaram. 2024b. Are large
language model-based evaluators the solution to scal-
Jiawei Chen, Hongyu Lin, Xianpei Han, and Le Sun. ing up multilingual evaluation? In Findings of the
2024. Benchmarking large language models in Association for Computational Linguistics: EACL
retrieval-augmented generation. Proceedings of 2024, pages 1051–1070, St. Julian’s, Malta. Associa-
the AAAI Conference on Artificial Intelligence, tion for Computational Linguistics.
38(16):17754–17762.
Jacob Islary. 2018. Health and health seeking behaviour
Nadezhda Chirkova, David Rau, Hervé Déjean, Thibault among tribal communities in india: A socio-cultural
Formal, Stéphane Clinchant, and Vassilina Nikoulina. perspective.
2024. Retrieval-augmented generation in multi-
lingual settings. In Proceedings of the 1st Work- Yiqiao Jin, Mohit Chandra, Gaurav Verma, Yibo Hu,
shop on Towards Knowledgeable Language Models Munmun De Choudhury, and Srijan Kumar. 2024.
(KnowLLM 2024), pages 177–188, Bangkok, Thai- Better to ask in english: Cross-lingual evaluation
land. Association for Computational Linguistics. of large language models for healthcare queries. In
Proceedings of the ACM on Web Conference 2024,
Yu Ying Chiu, Liwei Jiang, Bill Yuchen Lin, pages 2627–2638.
Chan Young Park, Shuyue Stella Li, Sahithya Ravi,
Mehar Bhatia, Maria Antoniak, Yulia Tsvetkov, Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang,
Vered Shwartz, et al. 2024. Culturalbench: a robust, Shayne Longpre, Hwaran Lee, Sangdoo Yun,
diverse and challenging benchmark on measuring the Seongjin Shin, Sungdong Kim, James Thorne, and
(lack of) cultural knowledge of llms. arXiv preprint Minjoon Seo. 2024a. Prometheus: Inducing fine-
arXiv:2410.02677. grained evaluation capability in language models. In
The Twelfth International Conference on Learning
Alexis Conneau, Ruty Rinott, Guillaume Lample, Ad- Representations.
ina Williams, Samuel Bowman, Holger Schwenk,
and Veselin Stoyanov. 2018. Xnli: Evaluating cross- Seungone Kim, Juyoung Suk, Shayne Longpre,
lingual sentence representations. In Proceedings of Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham
the 2018 Conference on Empirical Methods in Natu- Neubig, Moontae Lee, Kyungjae Lee, and Minjoon
ral Language Processing, pages 2475–2485. Seo. 2024b. Prometheus 2: An open source language
model specialized in evaluating other language mod-
Moumita Das, Federica Angeli, Anja J. S. M. Krumeich, els. Preprint, arXiv:2405.01535.
and Onno C. P. van Schayck. 2018. The gendered
experience with respect to health-seeking behaviour Tom Kocmi and Christian Federmann. 2023. Large lan-
in an urban slum of kolkata, india - international guage models are state-of-the-art evaluators of trans-
journal for equity in health. lation quality. In Proceedings of the 24th Annual
Conference of the European Association for Machine
Sumanth Doddapaneni, Mohammed Safi Ur Rahman Translation, pages 193–203, Tampere, Finland. Euro-
Khan, Sshubam Verma, and Mitesh M. Khapra. pean Association for Machine Translation.
2024. Finding blind spots in evaluator llms with
interpretable checklists. arXiv preprint arXiv: Viet Lai, Nghia Ngo, Amir Pouran Ben Veyseh, Hieu
2406.13439. Man, Franck Dernoncourt, Trung Bui, and Thien
Nguyen. 2023. ChatGPT beyond English: Towards
Shahul Es, Jithin James, Luis Espinosa Anke, and a comprehensive evaluation of large language mod-
Steven Schockaert. 2024. RAGAs: Automated evalu- els in multilingual learning. In Findings of the As-
ation of retrieval augmented generation. In Proceed- sociation for Computational Linguistics: EMNLP
ings of the 18th Conference of the European Chap- 2023, pages 13171–13189, Singapore. Association
ter of the Association for Computational Linguistics: for Computational Linguistics.
Wei Qi Leong, Jian Gang Ngui, Yosephine Su- Translation, pages 392–418, Singapore. Association
santo, Hamsawardhini Rengarajan, Kengatharaiyer for Computational Linguistics.
Sarveswaran, and William Chandra Tjhi. 2023.
Bhasa: A holistic southeast asian linguistic and Alireza Salemi and Hamed Zamani. 2024. Evaluating
cultural evaluation suite for large language models. retrieval quality in retrieval-augmented generation.
Preprint, arXiv:2309.06085. In Proceedings of the 47th International ACM SI-
GIR Conference on Research and Development in
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Information Retrieval, SIGIR ’24, page 2395–2400,
Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian New York, NY, USA. Association for Computing
Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Ku- Machinery.
mar, Benjamin Newman, Binhang Yuan, Bobby Yan,
Ce Zhang, Christian Alexander Cosgrove, Christo- Chenhui Shen, Liying Cheng, Xuan-Phi Nguyen, Yang
pher D Manning, Christopher Re, Diana Acosta- You, and Lidong Bing. 2023. Large language mod-
Navas, Drew Arad Hudson, Eric Zelikman, Esin els are not yet human-level evaluators for abstrac-
Durmus, Faisal Ladhak, Frieda Rong, Hongyu Ren, tive summarization. In Findings of the Association
Huaxiu Yao, Jue WANG, Keshav Santhanam, Laurel for Computational Linguistics: EMNLP 2023, pages
Orr, Lucia Zheng, Mert Yuksekgonul, Mirac Suzgun, 4215–4233, Singapore. Association for Computa-
Nathan Kim, Neel Guha, Niladri S. Chatterji, Omar tional Linguistics.
Khattab, Peter Henderson, Qian Huang, Ryan An-
drew Chi, Sang Michael Xie, Shibani Santurkar, Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao,
Surya Ganguli, Tatsunori Hashimoto, Thomas Icard, Abu Awal Md Shoeb, Abubakar Abid, Adam
Tianyi Zhang, Vishrav Chaudhary, William Wang, Fisch, Adam R. Brown, Adam Santoro, Aditya
Xuechen Li, Yifan Mai, Yuhui Zhang, and Yuta Ko- Gupta, Adrià Garriga-Alonso, Agnieszka Kluska,
reeda. 2023. Holistic evaluation of language models. Aitor Lewkowycz, Akshat Agarwal, Alethea Power,
Transactions on Machine Learning Research. Fea- Alex Ray, Alex Warstadt, Alexander W. Kocurek,
tured Certification, Expert Certification. Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Par-
rish, Allen Nie, Aman Hussain, Amanda Askell,
Yang Liu, Meng Xu, Shuo Wang, Liner Yang, Haoyu Amanda Dsouza, Ambrose Slone, Ameet Rahane,
Wang, Zhenghao Liu, Cunliang Kong, Yun Chen, Anantharaman S. Iyer, Anders Andreassen, Andrea
Yang Liu, Maosong Sun, and Erhong Yang. 2024. Madotto, Andrea Santilli, Andreas Stuhlmüller, An-
Omgeval: An open multilingual generative evalua- drew Dai, Andrew La, Andrew Lampinen, Andy
tion benchmark for large language models. Preprint, Zou, Angela Jiang, Angelica Chen, Anh Vuong,
arXiv:2402.13524. Animesh Gupta, Anna Gottardi, Antonio Norelli,
Anu Venkatesh, Arash Gholamidavoodi, Arfa Tabas-
Ritwik Mishra, Rajiv Ratan Shah, Pushpendra Singh, sum, Arul Menezes, Arun Kirubarajan, Asher Mul-
Jasmeet Kaur, and Simranjeet Singh. 2023. [link]. lokandov, Ashish Sabharwal, Austin Herrick, Avia
Efrat, Aykut Erdem, Ayla Karakaş, B. Ryan Roberts,
Kun-Peng Ning, Shuo Yang, Yu-Yang Liu, Jia-Yu Yao,
Bao Sheng Loe, Barret Zoph, Bartłomiej Bojanowski,
Zhen-Hui Liu, Yu Wang, Ming Pang, and Li Yuan.
Batuhan Özyurt, Behnam Hedayatnia, Behnam
2024. Pico: Peer review in llms based on the
Neyshabur, Benjamin Inden, Benno Stein, Berk
consistency optimization. arXiv preprint arXiv:
Ekmekci, Bill Yuchen Lin, Blake Howald, Bryan
2402.01830.
Orinion, Cameron Diao, Cameron Dour, Cather-
Yonatan Oren, Nicole Meister, Niladri S. Chatterji, ine Stinson, Cedrick Argueta, César Ferri Ramírez,
Faisal Ladhak, and Tatsunori Hashimoto. 2024. Prov- Chandan Singh, Charles Rathkopf, Chenlin Meng,
ing test set contamination in black-box language Chitta Baral, Chiyu Wu, Chris Callison-Burch, Chris
models. In The Twelfth International Conference Waites, Christian Voigt, Christopher D. Manning,
on Learning Representations. Christopher Potts, Cindy Ramirez, Clara E. Rivera,
Clemencia Siro, Colin Raffel, Courtney Ashcraft,
Xiaoman Pan, Boliang Zhang, Jonathan May, Joel Noth- Cristina Garbacea, Damien Sileo, Dan Garrette, Dan
man, Kevin Knight, and Heng Ji. 2017. Cross-lingual Hendrycks, Dan Kilman, Dan Roth, Daniel Free-
name tagging and linking for 282 languages. In Pro- man, Daniel Khashabi, Daniel Levy, Daniel Moseguí
ceedings of the 55th annual meeting of the associa- González, Danielle Perszyk, Danny Hernandez,
tion for computational linguistics (volume 1: long Danqi Chen, Daphne Ippolito, Dar Gilboa, David Do-
papers), pages 1946–1958. han, David Drakard, David Jurgens, Debajyoti Datta,
Deep Ganguli, Denis Emelin, Denis Kleyko, Deniz
Pragnya Ramjee, Bhuvan Sachdeva, Satvik Golechha, Yuret, Derek Chen, Derek Tam, Dieuwke Hupkes,
Shreyas Kulkarni, Geeta Fulari, Kaushik Murali, and Diganta Misra, Dilyar Buzan, Dimitri Coelho Mollo,
Mohit Jain. 2024. Cataractbot: An llm-powered Diyi Yang, Dong-Ho Lee, Dylan Schrader, Ekaterina
expert-in-the-loop chatbot for cataract patients. Shutova, Ekin Dogus Cubuk, Elad Segal, Eleanor
Hagerman, Elizabeth Barnes, Elizabeth Donoway, El-
Nathaniel Robinson, Perez Ogayo, David R. Mortensen, lie Pavlick, Emanuele Rodola, Emma Lam, Eric Chu,
and Graham Neubig. 2023. ChatGPT MT: Competi- Eric Tang, Erkut Erdem, Ernie Chang, Ethan A. Chi,
tive for high- (but not low-) resource languages. In Ethan Dyer, Ethan Jerzak, Ethan Kim, Eunice En-
Proceedings of the Eighth Conference on Machine gefu Manyasi, Evgenii Zheltonozhskii, Fanyue Xia,
Fatemeh Siar, Fernando Martínez-Plumed, Francesca Novak, Roman Sitelew, Ronan LeBras, Rosanne
Happé, Francois Chollet, Frieda Rong, Gaurav Liu, Rowan Jacobs, Rui Zhang, Ruslan Salakhut-
Mishra, Genta Indra Winata, Gerard de Melo, Ger- dinov, Ryan Chi, Ryan Lee, Ryan Stovall, Ryan
mán Kruszewski, Giambattista Parascandolo, Gior- Teehan, Rylan Yang, Sahib Singh, Saif M. Moham-
gio Mariani, Gloria Wang, Gonzalo Jaimovitch- mad, Sajant Anand, Sam Dillavou, Sam Shleifer,
López, Gregor Betz, Guy Gur-Ari, Hana Galijase- Sam Wiseman, Samuel Gruetter, Samuel R. Bow-
vic, Hannah Kim, Hannah Rashkin, Hannaneh Ha- man, Samuel S. Schoenholz, Sanghyun Han, San-
jishirzi, Harsh Mehta, Hayden Bogar, Henry Shevlin, jeev Kwatra, Sarah A. Rous, Sarik Ghazarian, Sayan
Hinrich Schütze, Hiromu Yakura, Hongming Zhang, Ghosh, Sean Casey, Sebastian Bischoff, Sebastian
Hugh Mee Wong, Ian Ng, Isaac Noble, Jaap Jumelet, Gehrmann, Sebastian Schuster, Sepideh Sadeghi,
Jack Geissinger, Jackson Kernion, Jacob Hilton, Jae- Shadi Hamdan, Sharon Zhou, Shashank Srivastava,
hoon Lee, Jaime Fernández Fisac, James B. Simon, Sherry Shi, Shikhar Singh, Shima Asaadi, Shixi-
James Koppel, James Zheng, James Zou, Jan Kocoń, ang Shane Gu, Shubh Pachchigar, Shubham Tosh-
Jana Thompson, Janelle Wingfield, Jared Kaplan, niwal, Shyam Upadhyay, Shyamolima, Debnath,
Jarema Radom, Jascha Sohl-Dickstein, Jason Phang, Siamak Shakeri, Simon Thormeyer, Simone Melzi,
Jason Wei, Jason Yosinski, Jekaterina Novikova, Siva Reddy, Sneha Priscilla Makini, Soo-Hwan Lee,
Jelle Bosscher, Jennifer Marsh, Jeremy Kim, Jeroen Spencer Torene, Sriharsha Hatwar, Stanislas De-
Taal, Jesse Engel, Jesujoba Alabi, Jiacheng Xu, Ji- haene, Stefan Divic, Stefano Ermon, Stella Bider-
aming Song, Jillian Tang, Joan Waweru, John Bur- man, Stephanie Lin, Stephen Prasad, Steven T. Pi-
den, John Miller, John U. Balis, Jonathan Batchelder, antadosi, Stuart M. Shieber, Summer Misherghi, Svet-
Jonathan Berant, Jörg Frohberg, Jos Rozen, Jose lana Kiritchenko, Swaroop Mishra, Tal Linzen, Tal
Hernandez-Orallo, Joseph Boudeman, Joseph Guerr, Schuster, Tao Li, Tao Yu, Tariq Ali, Tatsu Hashimoto,
Joseph Jones, Joshua B. Tenenbaum, Joshua S. Rule, Te-Lin Wu, Théo Desbordes, Theodore Rothschild,
Joyce Chua, Kamil Kanclerz, Karen Livescu, Karl Thomas Phan, Tianle Wang, Tiberius Nkinyili, Timo
Krauth, Karthik Gopalakrishnan, Katerina Ignatyeva, Schick, Timofei Kornev, Titus Tunduny, Tobias Ger-
Katja Markert, Kaustubh D. Dhole, Kevin Gim- stenberg, Trenton Chang, Trishala Neeraj, Tushar
pel, Kevin Omondi, Kory Mathewson, Kristen Chi- Khot, Tyler Shultz, Uri Shaham, Vedant Misra, Vera
afullo, Ksenia Shkaruta, Kumar Shridhar, Kyle Mc- Demberg, Victoria Nyamai, Vikas Raunak, Vinay
Donell, Kyle Richardson, Laria Reynolds, Leo Gao, Ramasesh, Vinay Uday Prabhu, Vishakh Padmaku-
Li Zhang, Liam Dugan, Lianhui Qin, Lidia Contreras- mar, Vivek Srikumar, William Fedus, William Saun-
Ochando, Louis-Philippe Morency, Luca Moschella, ders, William Zhang, Wout Vossen, Xiang Ren, Xi-
Lucas Lam, Lucy Noble, Ludwig Schmidt, Luheng aoyu Tong, Xinran Zhao, Xinyi Wu, Xudong Shen,
He, Luis Oliveros Colón, Luke Metz, Lütfi Kerem Yadollah Yaghoobzadeh, Yair Lakretz, Yangqiu Song,
Şenel, Maarten Bosma, Maarten Sap, Maartje ter Yasaman Bahri, Yejin Choi, Yichi Yang, Yiding
Hoeve, Maheen Farooqi, Manaal Faruqui, Mantas Hao, Yifu Chen, Yonatan Belinkov, Yu Hou, Yufang
Mazeika, Marco Baturan, Marco Marelli, Marco Hou, Yuntao Bai, Zachary Seid, Zhuoye Zhao, Zijian
Maru, Maria Jose Ramírez Quintana, Marie Tolkiehn, Wang, Zijie J. Wang, Zirui Wang, and Ziyi Wu. 2022.
Mario Giulianelli, Martha Lewis, Martin Potthast, Beyond the imitation game: Quantifying and extrap-
Matthew L. Leavitt, Matthias Hagen, Mátyás Schu- olating the capabilities of language models. arXiv
bert, Medina Orduna Baitemirova, Melody Arnaud, preprint arXiv: 2206.04615.
Melvin McElrath, Michael A. Yee, Michael Co-
hen, Michael Gu, Michael Ivanitskiy, Michael Star- Yixuan Tang and Yi Yang. 2024. Multihop-rag: Bench-
ritt, Michael Strube, Michał Sw˛edrowski, Michele marking retrieval-augmented generation for multi-
Bevilacqua, Michihiro Yasunaga, Mihir Kale, Mike hop queries. Preprint, arXiv:2401.15391.
Cain, Mimee Xu, Mirac Suzgun, Mitch Walker,
Mo Tiwari, Mohit Bansal, Moin Aminnaseri, Mor Hua Wang, Sneha Gupta, Arvind Singhal, Poonam Mut-
Geva, Mozhdeh Gheini, Mukund Varma T, Nanyun treja, Sanghamitra Singh, Poorva Sharma, and Alice
Peng, Nathan A. Chi, Nayeon Lee, Neta Gur-Ari Piterova. An artificial intelligence chatbot for young
Krakover, Nicholas Cameron, Nicholas Roberts, people’s sexual and reproductive health in india (sne-
Nick Doiron, Nicole Martinez, Nikita Nangia, Niklas hai): Instrumental case study.
Deckers, Niklas Muennighoff, Nitish Shirish Keskar, Ishaan Watts, Varun Gumma, Aditya Yadavalli, Vivek
Niveditha S. Iyer, Noah Constant, Noah Fiedel, Nuan Seshadri, Manohar Swaminathan, and Sunayana
Wen, Oliver Zhang, Omar Agha, Omar Elbaghdadi, Sitaram. 2024. Pariksha : A large-scale investiga-
Omer Levy, Owain Evans, Pablo Antonio Moreno tion of human-llm evaluator agreement on multilin-
Casares, Parth Doshi, Pascale Fung, Paul Pu Liang, gual and multi-cultural data. arXiv preprint arXiv:
Paul Vicol, Pegah Alipoormolabashi, Peiyuan Liao, 2406.15053.
Percy Liang, Peter Chang, Peter Eckersley, Phu Mon
Htut, Pinyu Hwang, Piotr Miłkowski, Piyush Patil, Ziang Xiao, Q. Vera Liao, Michelle Zhou, Tyrone Gran-
Pouya Pezeshkpour, Priti Oli, Qiaozhu Mei, Qing dison, and Yunyao Li. 2023. Powering an ai chatbot
Lyu, Qinlang Chen, Rabin Banjade, Rachel Etta with expert sourcing to support credible health infor-
Rudolph, Raefer Gabriel, Rahel Habacker, Ramon mation access.
Risco, Raphaël Millière, Rhythm Garg, Richard
Barnes, Rif A. Saurous, Riku Arakawa, Robbe Guangzhi Xiong, Qiao Jin, Zhiyong Lu, and Aidong
Raymaekers, Robert Frank, Rohan Sikand, Roman Zhang. 2024a. Benchmarking retrieval-augmented
generation for medicine. In Findings of the Associa- Hindi
gpt-4o
tion for Computational Linguistics ACL 2024, pages
Mistral-Large-Instruct-2407
6233–6251, Bangkok, Thailand and virtual meeting. Indic-gemma-7b-finetuned-sft-Navarasa-2.0
Association for Computational Linguistics. Phi-3.5-MoE-instruct

Meta-Llama-3.1-70B-Instruct

Guangzhi Xiong, Qiao Jin, Xiao Wang, Minjia Zhang, Kannada


Mistral-Large-Instruct-2407
Zhiyong Lu, and Aidong Zhang. 2024b. Im-
Phi-3.5-MoE-instruct
proving retrieval-augmented generation in medicine gpt-4o
with iterative follow-up questions. arXiv preprint Indic-gemma-7b-finetuned-sft-Navarasa-2.0

arXiv:2408.00727. Meta-Llama-3.1-70B-Instruct

Tamil
gpt-4o
Cheng Xu, Shuhao Guan, Derek Greene, and M-Tahar Indic-gemma-7b-finetuned-sft-Navarasa-2.0
Kechadi. 2024. Benchmark data contamination of Mistral-Large-Instruct-2407
large language models: A survey. arXiv preprint Phi-3.5-MoE-instruct

arXiv: 2406.04244. Meta-Llama-3.1-70B-Instruct

Telugu
Phi-3.5-MoE-instruct
Deepika Yadav, Prerna Malik, Kirti Dabas, Pushpen- gpt-4o
dra Singh, Delhi Deepika YadavIndraprastha Insti- Meta-Llama-3.1-70B-Instruct
tute of Information Technology, Delhi Prerna Ma- Indic-gemma-7b-finetuned-sft-Navarasa-2.0

likIndraprastha Institute of Information Technology, Mistral-Large-Instruct-2407


0.0 0.2 0.4 0.6 0.8 1.0
Delhi Kirti DabasIndraprastha Institute of Informa- Percentage Agreement

tion Technology, and Delhi Pushpendra SinghIn-


draprastha Institute of Information Technology. 2019. Figure 2: Percentage agreement between human and
Feedpal: Understanding opportunities for chatbots in LLM-evaluators for Indic languages
breastfeeding education of women in india.

Hao Yu, Aoran Gan, Kai Zhang, Shiwei Tong, Qi Liu, - You are an cataract chatbot whose primary goal is to help patients
and Zhaofeng Liu. 2024. Evaluation of retrieval- undergoing or undergone a cataract surgery.
- If the query can be truthfully and factually answered using the
augmented generation: A survey. arXiv preprint knowledge base only, answer it concisely in a polite and professional
arXiv: 2405.07437. way. If not, then just say: ‘I do not know the answer to your
question. If this needs to be answered by a doctor, please schedule
a consultation.‘
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan - Incase of a conflict between raw knowledge base and new knowledge
base, always prefer the new knowledge base, and the latest source
Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, in the new knowledge base. Note that, either the raw knowledge base
Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, or the new knowledge base can be empty.
- The provided query is in {query_lang}, and you must always respond
Joseph E. Gonzalez, and Ion Stoica. 2023. Judging in {response_lang}.
LLM-as-a-judge with MT-bench and chatbot arena. - Do not generate any other opening or closing statements or remarks.

In Thirty-seventh Conference on Neural Information


Processing Systems Datasets and Benchmarks Track.
Figure 3: System Prompt for Generation
A Appendix
A.1 LLM-evaluator scores for non-English - You are a helpful, unbiased **evaluator** that judges the quality
of the response generated by the model given a query, relevant
languages knowledge base chunks, ground-truth reference, and a metric to
evaluate the response. Note that, not all the information will be
provided to you in every case, and you must evaluate the response
A.2 Prompts based only on the information provided to you.
- The metric will be always provided to you in a **JSON** format,
and you have to use that metric to evaluate the response. You
**MUST NOT** change or digress from the metric provided to you. -
A.3 Comparison of human and In each case, you **MUST ALWAYS** prioritize the knowledge from
the new/updated knowledge base over the raw knowledge base.
LLM-evaluator ranking - **IF** a reference ground truth is provided, you **MUST** take
it as the most optimal response and evaluate the response based on
the metric provided to you.
- In all cases, the knowledge base will serve as the **ONLY**
knowledge source for you to generate the response, and you **MUST
NEVER** use any of your internal knowledge to evaluate the response
for factuality and information retrieval.

- Your output **MUST** be a **JSON** dictionary with the following


keys:

– Score: The score of the response based on the metric provided


to you. The score should be an integer value from 0 to 2,
as mentioned in the metric.

– Justification: A brief justification (in English) of


the score you have assigned the response. Your
justification **MUST** always reference the relevant pieces
from the answer, query, and knowledge base chunks for
interpretability.

Figure 4: System prompt for evaluation


Model AGG COH CON FC SS

GPT-4 1.21 1.74 1.79 1.26 1.16


Q WEN 2.5-72B-I NSTRUCT 1.20 1.89 1.95 1.21 1.11
M ISTRAL -L ARGE -I NSTRUCT-2407 1.18 1.53 1.79 1.26 1.16
GEMMA -2-27 B - IT 0.93 1.11 1.89 1.05 0.79
AYA -23-35B 0.92 0.95 1.79 1.05 0.84
A RYA B HATTA -G EMMAG EN Z-V IKAS -M ERGED 0.89 1.11 1.32 1.00 0.84
P HI -3.5-M O E- INSTRUCT 0.81 1.11 1.74 0.79 0.79
GPT-4 O 0.76 0.74 1.79 0.84 0.74
A RYA B HATTA -G EMMAO RCA -M ERGED 0.64 1.00 1.21 0.58 0.68
A IRAVATA 0.63 0.84 1.26 0.68 0.53
LL AMA 3-G AJA -H INDI -8B- V 0.1 0.60 0.79 1.26 0.63 0.53
OPEN - ADITI - HI - V 4 0.56 0.89 1.00 0.47 0.63
L LAMAVAAD 0.55 0.47 0.21 0.68 0.63
C 4 AI - COMMAND - R - PLUS -08-2024 0.52 0.95 1.47 0.47 0.37
M ETA -L LAMA -3.1-70B-I NSTRUCT 0.48 0.47 1.16 0.53 0.47
G AJENDRA - V 0.1 0.38 0.47 0.68 0.37 0.42
LLAMA 38 B G EN Z_V IKAS -M ERGED 0.32 0.21 1.00 0.32 0.37
A RYA B HATTA -G EMMAU LTRA -M ERGED 0.31 0.37 1.00 0.32 0.26
I NDIC - GEMMA -7 B - FINETUNED - SFT-NAVARASA -2.0 0.24 0.11 0.53 0.26 0.32

Table 3: Metric-wise scores for Hindi

Model AGG COH CON FC SS

Q WEN 2.5-72B-I NSTRUCT 1.29 1.87 1.96 1.35 1.22


GPT-4 1.18 1.78 1.96 1.30 0.91
M ISTRAL -L ARGE -I NSTRUCT-2407 1.09 1.39 1.96 1.22 0.96
GEMMA -2-27 B - IT 0.92 1.30 1.91 1.04 0.65
GPT-4 O 0.88 0.96 2.00 1.00 0.74
A RYA B HATTA -G EMMAO RCA -M ERGED 0.51 0.57 1.13 0.52 0.52
M ETA -L LAMA -3.1-70B-I NSTRUCT 0.48 0.43 0.78 0.57 0.48
K AN -L LAMA -7B-SFT- V 0.5 0.47 0.52 1.04 0.48 0.48
LLAMA 38 B G EN Z_V IKAS -M ERGED 0.47 0.52 1.00 0.43 0.57
I NDIC - GEMMA -7 B - FINETUNED - SFT-NAVARASA -2.0 0.24 0.35 0.39 0.26 0.22
P HI -3.5-M O E- INSTRUCT 0.20 0.26 1.22 0.17 0.09
A RYA B HATTA -G EMMAU LTRA -M ERGED 0.13 0.17 0.70 0.09 0.13
A MBARI -7B-I NSTRUCT- V 0.2 0.05 0.04 0.13 0.04 0.09

Table 4: Metric-wise scores for Kannada


Model AGG COH CON FC SS

Q WEN 2.5-72B-I NSTRUCT 1.29 1.87 1.96 1.35 1.22


GPT-4 1.18 1.78 1.96 1.30 0.91
M ISTRAL -L ARGE -I NSTRUCT-2407 1.09 1.39 1.96 1.22 0.96
GEMMA -2-27 B - IT 0.92 1.30 1.91 1.04 0.65
GPT-4 O 0.88 0.96 2.00 1.00 0.74
A RYA B HATTA -G EMMAO RCA -M ERGED 0.51 0.57 1.13 0.52 0.52
M ETA -L LAMA -3.1-70B-I NSTRUCT 0.48 0.43 0.78 0.57 0.48
K AN -L LAMA -7B-SFT- V 0.5 0.47 0.52 1.04 0.48 0.48
LLAMA 38 B G EN Z_V IKAS -M ERGED 0.47 0.52 1.00 0.43 0.57
I NDIC - GEMMA -7 B - FINETUNED - SFT-NAVARASA -2.0 0.24 0.35 0.39 0.26 0.22
P HI -3.5-M O E- INSTRUCT 0.20 0.26 1.22 0.17 0.09
A RYA B HATTA -G EMMAU LTRA -M ERGED 0.13 0.17 0.70 0.09 0.13
A MBARI -7B-I NSTRUCT- V 0.2 0.05 0.04 0.13 0.04 0.09

Table 5: Metric-wise scores for Tamil

Model AGG COH CON FC SS

GPT-4 1.14 1.64 2.00 1.29 0.86


Q WEN 2.5-72B-I NSTRUCT 1.11 1.57 1.71 1.29 0.86
M ISTRAL -L ARGE -I NSTRUCT-2407 1.03 1.36 2.00 1.14 0.86
GEMMA -2-27 B - IT 0.91 1.21 2.00 1.00 0.71
M ETA -L LAMA -3.1-70B-I NSTRUCT 0.61 0.43 1.00 0.79 0.57
GPT-4 O 0.54 0.57 1.86 0.57 0.43
P HI -3.5-M O E- INSTRUCT 0.44 0.57 1.86 0.43 0.29
LLAMA 38 B G EN Z_V IKAS -M ERGED 0.33 0.14 1.50 0.36 0.29
A RYA B HATTA -G EMMAO RCA -M ERGED 0.29 0.29 0.93 0.29 0.29
A RYA B HATTA -G EMMAU LTRA -M ERGED 0.26 0.29 1.71 0.21 0.14
I NDIC - GEMMA -7 B - FINETUNED - SFT-NAVARASA -2.0 0.19 0.29 0.57 0.21 0.07
TELUGU - LLAMA -7 B - INSTRUCT- V 0.1 0.09 0.00 1.71 0.00 0.00
T ELUGU -L LAMA 2-7B- V 0-I NSTRUCT 0.00 0.00 0.00 0.00 0.00

Table 6: Metric-wise scores for Telugu

“name": “Coherence",
“description": "Coherence assesses the logical flow of the response, ensuring that one idea leads smoothly to the next. A coherent response should
present information in a structured manner, making it easy for the reader to follow the thought process without confusion.",

"scoring": {
"0": {
"(a)": "The response is highly disorganized and lacks a clear structure, making it difficult to follow.",
"(b)": "Sentences or ideas appear out of order or are disconnected, resulting in a confusing or jarring reading experience.",
"(c)": "The overall message is unclear due to poor organization."
},
"1": {
"(a)": "The response has some structure but includes noticeable breaks in the logical flow.",
"(b)": "Transitions between ideas may be abrupt, or there may be gaps in the reasoning, forcing the reader to make extra effort
to follow along.",
"(c)": "While the main point is evident, the flow is inconsistent."
},
"2": {
"(a)": "The response is well-organized and flows logically from one idea to the next.",
"(b)": "Each point builds naturally on the previous one, creating a clear and cohesive narrative.",
"(c)": "The reader can easily follow the thought process without having to backtrack or piece together disjointed information."
}
}

Figure 5: Metric description: C OHERENCE


“name": “Conciseness",
“description": "This metric evaluates how effectively the response conveys its message without unnecessary repetition or extraneous details. A
concise response is brief yet comprehensive, avoiding long-winded explanations and focusing on the core message. However, it must not sacrifice
clarity or completeness in the pursuit of brevity.",

"scoring": {
"0": {
"(a)": "The response is overly verbose, including repeated information, irrelevant details, or excessive explanations.",
"(b)": "It takes far longer than necessary to convey the intended message, making it inefficient and difficult to read."
},
"1": {
"(a)": "The response is somewhat concise but includes some unnecessary information or redundant points.",
"(b)": "While the main message is clear, the response could be made more efficient by removing repetition or streamlining explanations."
},
"2": {
"(a)": "The response is highly concise, delivering all relevant information in a brief and efficient manner.",
"(b)": "There is no repetition, and every sentence serves a clear purpose.",
"(c)": "The message is conveyed succinctly, without sacrificing clarity or detail."
}
}

Figure 6: Metric description: C ONCISENESS

“name": “Factual Accuracy",


“description": "This metric assesses the factual correctness of the response, focusing on whether the information provided aligns with verified
facts from the ground-truth answer and the available knowledge base. It evaluates both numerical and phrase-based facts, ensuring that key factual
elements such as data points, dates, and specific terminology are accurate and verifiable. The evaluation emphasizes the accuracy of important
details that are crucial for the validity of the response.",

"scoring": {
"0": {
"(a)": "The response contains one or more significant factual errors.",
"(b)": "Key facts, numbers, or data points are incorrect, misleading, or fabricated, and the response does not align with the ground-truth
or the knowledge base.",
"(c)": "The factual inaccuracies could lead to misunderstandings or incorrect conclusions."
},
"1": {
"(a)": "The response is partially accurate but contains minor factual inaccuracies or omissions.",
"(b)": "While the majority of facts are correct, some important details may be misstated or missing.",
"(c)": "The response captures the general truth but lacks precision or completeness in key factual areas."
},
"2": {
"(a)": "The response is factually accurate, with all critical facts, figures, and details aligned with the ground-truth answer and
knowledge base.",
"(b)": "There are no factual errors, and the information is presented with precision and correctness, making the response highly
reliable."
}
}

Figure 7: Metric description: FACTUAL C ORRECTNESS

“name": “Semantic Similarity",


“description": "This metric assesses the core meaning and factual alignment between the prediction and ground-truth. It evaluates whether critical
information such as factual knowledge, numbers, and key phrases match, prioritizing factual accuracy and the alignment of essential concepts over
stylistic or surface-level similarities.",

"scoring": {
"0": {
"(a)": "he prediction does not align with the ground truth in terms of key facts, numbers, or critical phrases.",
"(b)": "The core meaning of the prediction diverges entirely from the ground-truth.",
"(c)": "The differences would lead to misunderstandings or incorrect conclusions about the core message."
},
"1": {
"(a)": "The prediction contains some similarities to the ground truth, with some key facts, numbers, and phrases being correctly aligned.",
"(b)": "However, the prediction is missing some information or contains some added information.",
"(c)": "This causes the prediction to fail at encapsulating the entire core meaning present in the ground truth."
},
"2": {
"(a)": "The prediction is semantically similar to the ground-truth, with key facts, numbers, and phrases correctly aligned.",
"(b)": "Any differences are minor and do not significantly alter the core meaning or factual accuracy.",
"(c)": "The essential message of the prediction matches that of the ground-truth."
},
}

Figure 8: Metric description: S EMANTIC S IMILARITY


Language Human Ranking LLM Ranking
P HI -3.5-M O E- INSTRUCT (1.30), P HI -3.5-M O E- INSTRUCT (1.22),
M ISTRAL -L ARGE -I NSTRUCT-2407 (1.28) M ISTRAL -L ARGE -I NSTRUCT-2407 (1.14),
English GPT-4 O (0.90), GPT-4 O (0.87),
M ETA -L LAMA -3.1-70B-I NSTRUCT (0.88), M ETA -L LAMA -3.1-70B-I NSTRUCT (0.61),
I NDIC - GEMMA -7 B - FINETUNED - SFT-NAVARASA -2.0 (0.62) I NDIC - GEMMA -7 B - FINETUNED - SFT-NAVARASA -2.0 (0.41)
M ISTRAL -L ARGE -I NSTRUCT-2407 (1.21), M ISTRAL -L ARGE -I NSTRUCT-2407 (1.16),
P HI -3.5-M O E- INSTRUCT (0.95), P HI -3.5-M O E- INSTRUCT (0.79),
Hindi GPT-4 O (0.68), GPT-4 O (0.74),
M ETA -L LAMA -3.1-70B-I NSTRUCT (0.58), M ETA -L LAMA -3.1-70B-I NSTRUCT (0.47),
I NDIC - GEMMA -7 B - FINETUNED - SFT-NAVARASA -2.0 (0.53) I NDIC - GEMMA -7 B - FINETUNED - SFT-NAVARASA -2.0 (0.32)
M ISTRAL -L ARGE -I NSTRUCT-2407 (0.96), M ISTRAL -L ARGE -I NSTRUCT-2407 (0.96),
GPT-4 O (0.91), GPT-4 O (0.74),
Kannada M ETA -L LAMA -3.1-70B-I NSTRUCT (0.74), M ETA -L LAMA -3.1-70B-I NSTRUCT (0.48),
I NDIC - GEMMA -7 B - FINETUNED - SFT-NAVARASA -2.0 (0.35), I NDIC - GEMMA -7 B - FINETUNED - SFT-NAVARASA -2.0 (0.22),
P HI -3.5-M O E- INSTRUCT (0.17) P HI -3.5-M O E- INSTRUCT (0.09)
M ISTRAL -L ARGE -I NSTRUCT-2407 (1.37), M ISTRAL -L ARGE -I NSTRUCT-2407 (1.26),
GPT-4 O (1.07), GPT-4 O (1.04),
Tamil P HI -3.5-M O E- INSTRUCT (1.04), P HI -3.5-M O E- INSTRUCT (0.96),
M ETA -L LAMA -3.1-70B-I NSTRUCT (0.48), M ETA -L LAMA -3.1-70B-I NSTRUCT (0.48),
I NDIC - GEMMA -7 B - FINETUNED - SFT-NAVARASA -2.0 (0.19) I NDIC - GEMMA -7 B - FINETUNED - SFT-NAVARASA -2.0 (0.19)
M ISTRAL -L ARGE -I NSTRUCT-2407 (1.31), M ISTRAL -L ARGE -I NSTRUCT-2407 (0.77),
M ETA -L LAMA -3.1-70B-I NSTRUCT (0.62), M ETA -L LAMA -3.1-70B-I NSTRUCT (0.46),
Telugu I NDIC - GEMMA -7 B - FINETUNED - SFT-NAVARASA -2.0 (0.38), GPT-4 O (0.31),
GPT-4 O (0.38), P HI -3.5-M O E- INSTRUCT (0.15),
P HI -3.5-M O E- INSTRUCT (0.15) I NDIC - GEMMA -7 B - FINETUNED - SFT-NAVARASA -2.0 (0.08)

Table 7: Human and LLM ranking according to the direct assessment. The value in the bracket denotes the average
score of the metric S EMANTIC S IMILARITY which was used for the evaluation.

You might also like