0% found this document useful (0 votes)
5 views9 pages

2406.13188v1

Uploaded by

wangzichao6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views9 pages

2406.13188v1

Uploaded by

wangzichao6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

S YNTHETIC C ONTEXT G ENERATION FOR Q UESTION

G ENERATION

Naiming Liu Zichao Wang Richard Baraniuk


Rice University Adobe Rice University
[email protected] [email protected] [email protected]
arXiv:2406.13188v1 [cs.CL] 19 Jun 2024

A BSTRACT
Despite rapid advancements in large language models (LLMs), QG remains a challenging problem due
to its complicated process, open-ended nature, and the diverse settings in which question generation
occurs. A common approach to address these challenges involves fine-tuning smaller, custom models
using datasets containing background context, question, and answer. However, obtaining suitable
domain-specific datasets with appropriate context is often more difficult than acquiring question-
answer pairs. In this paper, we investigate training QG models using synthetic contexts generated by
LLMs from readily available question-answer pairs. We conduct a comprehensive study to answer
critical research questions related to the performance of models trained on synthetic contexts and
their potential impact on QG research and applications. Our empirical results reveal: 1) contexts are
essential for QG tasks, even if they are synthetic; 2) fine-tuning smaller language models has the
capability of achieving better performances as compared to prompting larger language models; and 3)
synthetic context and real context could achieve comparable performances. These findings highlight
the effectiveness of synthetic contexts in QG and paves the way for future advancements in the field.

Keywords Question Generation · Synthetic Data generation

1 Introduction

Recent years witness an increasing presence of Automatic question generation (QG) in various natural language
processing applications. For example, many proposed techniques for large language models (LLMs) leverage the idea
of QG to unlock LLMs’ reasoning capabilities and mitigate hallucination [32, 28]. In personalized education, QG has
the potential to enable custom learning experiences in subjects such as reading comprehension [34, 10, 36, 11] and
reduce the costs and lengths of standard assessment tests [1].
Despite the rapid progress in language technologies such as LLMs, QG still remains a challenging problem. This
is due to several factors: 1) LLMs are mostly designed to answer questions instead of asking them [32], making it
challenging to design appropriate prompt for QG; 2) question generation is mostly an open-ended process, requiring
creativity and deep domain knowledge [24, 14, 7]; 3) question generation involves diverse settings, each with its own
challenges [19, 13]. Because of these reasons, QG remains an active area of research.
A popular alternative to using LLMs for QG is to fine-tune a smaller, custom model, which can often achieve superior
performance on specific tasks than LLMs [8]. However, a challenge in fine-tuning models for QG is the availability
of training data, usually in the forms of {question, answer, context} triplets [31]. Whereas question-answer pairs are
relatively easy to obtain, the background text (context) associated with the question-answer pair are not. Some existing
datasets contain all three elements (contexts, questions, and answers) [21], but a model trained on such dataset often
do not appropriately adapt to a specific QG domain such as math word problems [30] or fairytales [35]. Obtaining
domain-specific dataset is still critical but the contexts are often either difficult to find or unavailable (e.g., copyrighted
or behind a paywall).
Running Title for Header

Figure 1: Detailed Overview of Context Generation for Question Generation. We first prompt LLMs to generate
synthetic context, then use the generated context and answer to fine-tune smaller LMs for question generation.

1.1 Contributions
In this paper, we study the problem of training a QG model from synthetic contexts, assuming data in the form of only
question and answer pairs, without contexts, are readily available. The idea is to generate synthetic contexts, using
LLMs, from the question-answer pairs, and then use the generated contexts, together with the question-answer pairs, as
training data to train a QG model.
Our idea is motivated by two recent successes and research trends. First, studies have demonstrated that today’s LLMs
can already generate highly diverse, creative, and human-like content [18, 16]. We posit that they can also generate
plausible background texts similar to the real ones. Second, recent research demonstrates the feasibility of training
high-performing, albeit smaller, models with synthetic data generated by LLMs [29, 25, 3]. These results make it
appealing to use LLMs to synthesize contexts and use them for training QG models.
We conduct a scientific investigation into the feasibility of generating synthetic contexts constitute for traing QG models.
Our empirical findings demonstrate the following: 1) QG tasks heavily rely on contexts, whether they are synthetic or
not; 2) Fine-tuning smaller language models can lead to superior performance compared to larger language models
when prompted; and 3) Synthetic contexts and real contexts can achieve similar levels of performance. These results
emphasize the efficacy of synthetic contexts in QG and open avenues for future advancements in this field.

2 Approach
Our Approach, context generation for question generation, leverages existing LLMs for creating synthetic context
for QG. Our method comprises two components: (1) Synthetic Context Generation, in which we employ LLMs
to produce contextually relevant samples based on given question-answer pairs, and (2) Context-Based Question
Generation, which uses the synthetic context generated from the previous step, along with real answers, to carry out
question generation tasks. A diagram of can be found in Figure 1.

2.1 Synthetic Context Generation


Consider a set of question-answer pairs {(q1 , a1 ), · · · , (qn , an )}, We first prompt an LLM to generate relevant context
ci that could help to answer the question qi with the given answer ai . We formulate our prompt as
Your job is to write a {style} paragraph that significantly expands the given question {q_i} and
answer {a_i}.
where {style} denotes the particular style of context required to be generated (for instance, for SQuAD dataset, it
would be “wikipedia-style”). Similarly, {q_i} and {a_i} are placeholders that will be replaced with questions and
answers respectively. For each (qi , ai ) pair, the LLM outputs a context ci that contains relevant background information
to support answering the question.

2
Running Title for Header

Model Bleu-4 Meteor Rouge-l Bleurt


Flan-T5-large
0.035 0.106 0.136 0.227
(w/o synthetic context)
Flan-T5-large 0.191 0.397 0.383 0.528
davinci-003 (zero) 0.079 0.262 0.212 0.419
davinci-003 (few) 0.116 0.309 0.269 0.507
GPT-3.5 (zero) 0.107 0.352 0.256 0.488
GPT-3.5 (few) 0.147 0.370 0.306 0.495

Table 1: Performance of question generation using the OS-bio dataset, where first row denotes QG without context and
the rest are QG with synthetic context. Noticeably, fine-tuning smaller language model outperforms prompting larger
language models.

Context Type Bleu-4 Meteor Rouge-l Bleurt


real 0.132 0.337 0.356 0.457
synthetic (zero) 0.143 0.312 0.333 0.433
synthetic (few) 0.151 0.324 0.347 0.443
real (all) 0.155 0.338 0.352 0.459

Table 2: QG performance on the SQuAD dataset trained with real vs. synthetic context generated with zero and few-shot
prompting. The first three rows takes 1000 samples as the training set, while the last row uses all (87599) datapoints,
serving as an uppper bound. Remarkably, 1000 synthetic contexts can already yield comparable QG performances
compared to all real context as the training set.

2.2 Context-based Question Generation

We adopt a language model (LM) fine-tuning approach for our context-based question generation. After obtaining
the context from the previous step, we construct a dataset of triplets {(q1 , c1 , a1 ), · · · , (qn , cn , an )}. Our goal is to
generate the ground truth question qi conditioned on the input context-answer pair (ci , ai ) using the negative log-
likelihood objective function. The input to LM uses the following template: Based on the context ci and answer
ai , generate a {style} question. Training follows the standard causal language modeling objective where the
model takes the context and answer as input and aims to predict the question.

3 Experiments

Dataset We conducted experiments on two datasets, OpenStax Biology-2e textbook (OS-Bio) [6] and SQuAD, the
Stanford Question Answering Dataset [22]. OS-bio dataset contains the review questions at the end of each chapter
of an introductory-college level biology textbook published by OpenStax, which we curate by ourselves and will be
open-sourced. All results are reported on the test split.

Experimental Setup For synthetic context generation, we use the GPT-3.5 model from OpenAI’s API [17] with
nucleus sampling [9] ( p = 1) and temperature of 0.9. We adopt both zero-shot and few-shot in-context learning
strategies where we use two pre-selected examples to guide LLMs for generation. For question generation, we fine-tune
pre-trained Flan-T5-large model [5] for 10 epochs with early stopping. During training, we use synthetic context;
during testing, we use real context (if available) in order to examine the QG model’s generalizability and performance
in real-world scenarios. Additionally, for QG on the SQuAD dataset, we randomly select 1000, 5000, 10000 datapoints
to generate synthetic context as our training data and then evaluate on the whole test set with real context.1 More
experiment details can be found in Appendix A.

Evaluation We choose four evaluation metrics including BLEU-4, METEOR, ROUGE-L, and BLEURT, all of
which have been widely used in existing QG works.

1
Due to the costs of OpenAI API, we choose to generate and experiment with synthetic context for a fraction of the training set.

3
Running Title for Header

Figure 2: Word count and perplexity distribution for real and synthetic context generated with few-shot learning.

Model Type BLEU-4 Meteor Rouge-l Bleurt Data Type BLEU-4 Meteor Rouge-l Bleurt
small (S) 0.106 0.250 0.282 0.371 small (S) 0.151 0.324 0.347 0.443
medium (S) 0.124 0.296 0.325 0.426 medium (S) 0.135 0.308 0.319 0.420
large (S) 0.132 0.324 0.347 0.453 large (S) 0.109 0.295 0.305 0.407
small (R) 0.100 0.241 0.267 0.359 small (R) 0.132 0.324 0.347 0.453
medium (R) 0.131 0.291 0.317 0.414 medium (R) 0.157 0.337 0.356 0.457
large (R) 0.151 0.324 0.347 0.443 large (R) 0.149 0.328 0.351 0.456

Table 3: Performance of QG on SQuAD for different model Table 4: Performance of QG on SQuAD for different train-
size. The small, medium and large represents Flan-T5-small, ing set size. The small, medium and large represents 1000,
-medium and -large respectively. (S) means synthetic context 5000, 10000 datapoints respectively. (S) means synthetic
generated using few-shot learning, while (R) denotes real context generated using few-shot learning, while (R) de-
context. notes real context.

3.1 Research Questions and Findings

We investigate six research questions (RQ) on the use of synthetic context for question generation, and subsequently
provide an in-depth analysis based on our experimental results.

RQ1: Is context necessary for question generation, even if the context is synthetic?
As demonstrated in Table 1, the performance of QG without context is significantly worse than QG with synthetic
context for fine-tuning Flan-T5-large model. This outcome can be attributed to the fact that the presence of context,
regardless of its authenticity, offers vital information for generating meaningful and appropriate questions. Hence, the
inclusion of context proves to be crucial for QG task, even if the provided context are synthetic.

RQ2: How does fine-tuned small language models compare to prompting LLMs on QG?
We also utilize OS-bio dataset to investigate the performance differences between fine-tuning smaller language models
and prompting larger language models. We employ Flan-T5-large model with 780M parameters for fine-tuning, while
adopt GPT-based models of 175B parameters for zero-shot and few-shot prompting. As illustrated in Table 1, when
synthetic context is available, fine-tuning smaller language models could achieve better performance compared to
prompting larger language models. Also as anticipated, when we use few-shot learning, the model performance
improves as compared to zero-shot learning, especially when the context is present. These comparisons demonstrate the
superior performance of fine-tuned small models compared to LLMs on the QG task.

RQ3: How does the question generated from real context compared to fake context?
We examine the QG performance difference between models trained on real and synthetic contexts. We compare three
types of contexts: real context from SQuAD dataset, synthetic context generated using GPT-3.5 API with zero-shot and
few-shot learning. Our results, as shown in Table 2, demonstrate that models trained on merely 1000 synthetic contexts

4
Running Title for Header

0.33 0.46
flan-t5-small flan-t5-small
flan-t5-base flan-t5-base
0.32 flan-t5-large flan-t5-large
0.44
0.31

0.30 0.42
Meteor

Bleurt
0.29
0.40
0.28

0.27
0.38
0.26

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Fraction of synthetic context Fraction of synthetic context
Figure 3: Performance of QG as the fraction of synthetic context increases. We present Meteor and Bleurt evaluation
metric here.

can yield strikingly comparable QG performances compared to models trained on all real context. In RQ5, we further
analyze the effect on the number of synthetic training examples.

RQ4: What is the synthetic contexts’ quality?


We perform two assessments to examine the quality of the generated contexts. First, we assess the disparities in word
count and perplexity (using GPT-2 model [20]), with the results presented in Figure 2. Despite the distribution of word
count between real and synthetic context being relatively similar, it is worth noting that the synthetic context exhibits
lower perplexity in general as compared to real context, suggesting that the synthetic contexts align better with the
language model’s distribution and are less diverse than real contexts.
Second, we conduct a question answering (QA) evaluation task to explore whether synthetic context contains the
useful information to answer the given question. We find that that 84% of the synthetic contexts contain the answer
phrase. Using a QA model2 , we also find that synthetic context enables the QA model to answer question correctly
61% of the time with 0.77 F1 score. We further manually examined 100 instances of synthetic context that did not
obtain an exact match, and upon inspecting these, we found that only seven contexts contained confusing information
for question-answering. These results indicate a reasonable level of information inclusion of the synthetic context,
suggesting its effectiveness for question generation.

RQ5: What is the impact of model size and training data size for question generation?
Table 3 evidently shows that an increase in model size leads to improved QG performance, regardless of the context
type. However, Table 4 shows that the effect of training dataset size on performance varies on the nature of the
context. For real context, an increase in the training dataset tends to results in improved QG performance. However, we
observe a decline in QG performance when the training dataset size increases for the synthetic dataset. We hypothesize
that more synthetic data introduce more noise and inconsistencies into the learning process, leading to an increased
mismatch between the training (synthetic contexts) and test (real contexts) data. A potential mitigation could be
few-shot prompting for context generation, which achieves the second strongest result in Table 1 and could potentially
reverse the diminishing trends in QG as the synthetic training dataset size increases.

RQ6: What happens when the real and synthetic context are mixed?
In order to further explore the effects of synthetic context for QG, we conducted an experiment with an interpolation
between real and synthetic context generated with few-shot prompting on the subset of 10,000 SQuAD datapoints. We
aim to investigate the changes in QG performance as the proportion of synthetic context increases within the training
dataset. The results in Figure 3 generally reveal a generally decreasing trends in performance, as anticipated, when the
proportion of synthetic context increases. Another intriguing observation is that when the training dataset contains only
synthetic context, the performance experiences a substantial decline, which implies the effectiveness of incorporating
some real context within the training dataset. Even the inclusion of a minimal portion (approximately 10%) of real
context mixed with a predominantly synthetic dataset appears to boost the QG performance. A possible explanation is
that integrating a small yet substantial portion of real context could help with reducing the gap between training and
2
We use the Roberta-base: https://huggingface.co/deepset/roberta-base-squad2

5
Running Title for Header

testing context. Further research could explore the ideal proportions of real and synthetic context to not only ensure
model performance but also mitigate the efforts of obtaining real context.

4 Related Work
Automatic question generation (QG) is a complex task that potentially 1) involves the generation of many different
types of questions including factual [31], multi-hop [24], multiple-choice [19], and multi-document [4]; 2) spans
multiple domains such as literacy [35], math [30], science [33], language learning [1], and customer services [27]; 3)
requires controllability in terms of difficulty [2] or multi-modality [15]. Furthermore, QG is generally considered a
more challenging task than question answering, where the former usually requires more creativity, intentionally, and
art [23, 12]. Because of the aforementioned complexity and challenges, currently large language models’ performance
in QG remains under-explored and unsatisfactory, and their utility is commonly limited in simple factual QG or as an
auxiliary task, rather than the objective, to facilitate another task such as QA. Fine-tuning a smaller, QG-specific model
is promising approach to improve domain-specific performance, as a few recent works have demonstrated. Our present
work fits in the fine-tuning paradigm and investigates the feasibility of synthesizing the contexts as part of the QG
fine-tuning data because the contexts are important to QG but can be challenging to obtain.
Our work is also related to the recent lines of work demonstrating the promise to use LLM generated synthetic data for
fine-tuning smaller models in situations such as instruction tuning [29, 25, 3]. Some works also show, empirically, that
synthetic data leads to stable and useful fine-tuning as long as the generated data is close to the real data [26]. These
prior works inspire our present work and motivate us to adapt the synthetic data fine-tuning paradigm to the niche
direction of QG. Methodology-wise, our work shares many merits with prior works. However, we investigate a few
questions that are unique to QG, such as the necessity of contexts during fine-tuning and the quality of the synthesized
contexts.
Our investigations reveals insights and caveats of synthesizing training data specifically for QG, offering practical
suggestions and opening further research opportunities to study the various approaches to fine-tune language models for
QG.

5 Conclusion
In this paper, we investigate the effects of using synthetic context generated by LLMs to train QG models. We
provide in-depth analysis to address critical research questions associated with this approach. Our experimental results
demonstrate that the importance of synthetic context for LM-based question generation task as they can achieve
comparable performance levels to real context. Furthermore, our findings revealed that, given the presence of related
context—irrespective of its origin—fine-tuning smaller language models can yield superior performance compared to
prompting larger language models.

Acknowledgments
This work was supported by NSF grants 1842378, ONR grant N0014-20-1-2534, AFOSR grant FA9550-22-1-0060,
and a Vannevar Bush Faculty Fellowship, ONR grant N00014-18-1-2047.

References
[1] J. Burstein, G. T. LaFlair, A. J. Kunnan, and A. A. von Davier. A theoretical assessment ecosystem for a digital-first
assessment—the duolingo english test. DRR-21-04, 2021.
[2] Y. Cheng, S. Li, B. Liu, R. Zhao, S. Li, C. Lin, and Y. Zheng. Guiding the growth: Difficulty-controllable question
generation through step-by-step rewriting. In Proceedings of the 59th Annual Meeting of the Association for
Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume
1: Long Papers). Association for Computational Linguistics, 2021.
[3] W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E. Gonzalez, I. Stoica,
and E. P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.
[4] W. S. Cho, Y. Zhang, S. Rao, A. Celikyilmaz, C. Xiong, J. Gao, M. Wang, and B. Dolan. Contrastive multi-
document question generation. In Proceedings of the 16th Conference of the European Chapter of the Association
for Computational Linguistics: Main Volume. Association for Computational Linguistics, 2021.

6
Running Title for Header

[5] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, E. Li, X. Wang, M. Dehghani, S. Brahma, et al.
Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
[6] M. A. Clark, M. Douglas, and J. Choi. Biology 2e. OpenStax, Houston, Texas, 2018.
[7] H. Elsahar, C. Gravier, and F. Laforest. Zero-shot question generation from knowledge graphs for unseen predicates
and entity types. In Proceedings of the 2018 Conference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 218–228, New
Orleans, Louisiana, June 2018. Association for Computational Linguistics.
[8] S. Gunasekar, Y. Zhang, J. Aneja, C. C. T. Mendes, A. D. Giorno, S. Gopi, M. Javaheripi, P. Kauffmann, G. de Rosa,
O. Saarikivi, A. Salim, S. Shah, H. S. Behl, X. Wang, S. Bubeck, R. Eldan, A. T. Kalai, Y. T. Lee, and Y. Li.
Textbooks are all you need, 2023.
[9] A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi. The curious case of neural text degeneration. arXiv preprint
arXiv:1904.09751, 2019.
[10] R. Kokku, S. Sundararajan, P. Dey, R. Sindhgatta, S. Nitta, and B. Sengupta. Augmenting classrooms with AI
for personalized education. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP). IEEE, Apr. 2018.
[11] D. Kulshreshtha, M. Shayan, R. Belfer, S. Reddy, I. V. Serban, and E. Kochmar. Few-shot question generation for
personalized feedback in intelligent tutoring systems, 2022.
[12] S. Le Baron Payne. The art of asking questions. Princeton Legacy Library. Princeton University Press, Princeton,
NJ, July 2014.
[13] C. Liang, X. Yang, N. Dave, D. Wham, B. Pursel, and C. L. Giles. Distractor generation for multiple choice
questions using learning to rank. In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building
Educational Applications, pages 284–290, New Orleans, Louisiana, June 2018. Association for Computational
Linguistics.
[14] T. Liu, Q. Fang, W. Ding, H. Li, Z. Wu, and Z. Liu. Mathematical word problem generation from commonsense
knowledge graph and equations. In Proceedings of the 2021 Conference on Empirical Methods in Natural
Language Processing, pages 4225–4240, Online and Punta Cana, Dominican Republic, Nov. 2021. Association
for Computational Linguistics.
[15] P. Lu, S. Mishra, T. Xia, L. Qiu, K.-W. Chang, S.-C. Zhu, O. Tafjord, P. Clark, and A. Kalyan. Learn to explain:
Multimodal reasoning via thought chains for science question answering. In The 36th Conference on Neural
Information Processing Systems (NeurIPS), 2022.
[16] OpenAI. Gpt-4 technical report, 2023.
[17] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray,
et al. Training language models to follow instructions with human feedback. Advances in Neural Information
Processing Systems, 35:27730–27744, 2022.
[18] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray,
J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and
R. Lowe. Training language models to follow instructions with human feedback, 2022.
[19] Z. Qiu, X. Wu, and W. Fan. Automatic distractor generation for multiple choice questions in standard tests. In
Proceedings of the 28th International Conference on Computational Linguistics, pages 2096–2106, Barcelona,
Spain (Online), Dec. 2020. International Committee on Computational Linguistics.
[20] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. Language models are unsupervised multitask
learners. OpenAI blog, 1(8):9, 2019.
[21] P. Rajpurkar, R. Jia, and P. Liang. Know what you don’t know: Unanswerable questions for SQuAD. In
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short
Papers), pages 784–789, Melbourne, Australia, July 2018. Association for Computational Linguistics.
[22] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. Squad: 100,000+ questions for machine comprehension of text.
arXiv preprint arXiv:1606.05250, 2016.
[23] E. H. Schein and P. A. Schein. Humble Inquiry, second edition: The gentle art of asking instead of telling.
Berrett-Koehler, Feb. 2021.
[24] D. Su, Y. Xu, W. Dai, Z. Ji, T. Yu, and P. Fung. Multi-hop question generation with graph convolutional network.
In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4636–4647, Online, Nov.
2020. Association for Computational Linguistics.

7
Running Title for Header

[25] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto. Stanford alpaca:
An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
[26] R. Taori and T. B. Hashimoto. Data feedback loops: Model-driven amplification of dataset biases, 2022.
[27] M. Wan and X. Chen. Beyond "how may i help you?": Assisting customer service agents with proactive responses,
2018.
[28] B. Wang, X. Deng, and H. Sun. Iteratively prompt pre-trained language models for chain of thought. In
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2714–2730,
Abu Dhabi, United Arab Emirates, Dec. 2022. Association for Computational Linguistics.
[29] Y. Wang, Y. Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, and H. Hajishirzi. Self-instruct: Aligning
language models with self-generated instructions, 2022.
[30] Z. Wang, A. Lan, and R. Baraniuk. Math word problem generation with mathematical consistency and problem
context constraints. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,
pages 5986–5999, Online and Punta Cana, Dominican Republic, Nov. 2021. Association for Computational
Linguistics.
[31] Z. Wang, A. S. Lan, W. Nie, A. E. Waters, P. J. Grimaldi, and R. G. Baraniuk. QG-net. In Proceedings of the Fifth
Annual ACM Conference on Learning at Scale. ACM, June 2018.
[32] J. Wei, X. Wang, D. Schuurmans, M. Bosma, brian ichter, F. Xia, E. H. Chi, Q. V. Le, and D. Zhou. Chain of
thought prompting elicits reasoning in large language models. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho,
editors, Advances in Neural Information Processing Systems, 2022.
[33] J. Welbl, N. F. Liu, and M. Gardner. Crowdsourcing multiple choice science questions. In Proceedings of the 3rd
Workshop on Noisy User-generated Text. Association for Computational Linguistics, 2017.
[34] J. H. Wolfe. Automatic question generation from text - an aid to independent study. In Proceedings of the ACM
SIGCSE-SIGCUE technical symposium on Computer science and education -. ACM Press, 1976.
[35] Y. Xu, D. Wang, M. Yu, D. Ritchie, B. Yao, T. Wu, Z. Zhang, T. Li, N. Bradford, B. Sun, T. Hoang, Y. Sang,
Y. Hou, X. Ma, D. Yang, N. Peng, Z. Yu, and M. Warschauer. Fantastic questions and where to find them:
FairytaleQA – an authentic dataset for narrative comprehension. In Proceedings of the 60th Annual Meeting of the
Association for Computational Linguistics (Volume 1: Long Papers), pages 447–460, Dublin, Ireland, May 2022.
Association for Computational Linguistics.
[36] Z. Zhang, Y. Xu, Y. Wang, B. Yao, D. Ritchie, T. Wu, M. Yu, D. Wang, and T. J.-J. Li. StoryBuddy: A human-AI
collaborative chatbot for parent-child interactive storytelling with flexible parental involvement. In CHI Conference
on Human Factors in Computing Systems. ACM, Apr. 2022.

A Experimental Details
A.1 OS-bio Dataset

We provide several examples of question-answer pairs in OS-bio dataset, along with their generated context in Table 5.
For the OS-bio dataset, we attempted to manually extract relevant information from the OS-bio textbook to serve as
real context. However, upon closer examination, we found that only approximately 50% of the samples in the test
set had related contexts within the textbook. The remaining samples either could only be inferred from the textbook
information or were spread across the entire chapter without corresponding to a specific section. Moreover, the QG
performance utilizing extracted contexts was not satisfactory, which underscores the importance of generating synthetic
contexts using LLMs for QG task.

A.2 Prompt design

In this section, we provide some additional prompts for context and question generation when using prompting strategy.
Initially, for the context generation with OS-bio dataset, the {style} placeholder is replaced by an introductory
college level scientific paragraph about biology, while for SQuAD dataset, {style} is wikipedia-style
paragraph. Additionally for the few-shot learning with SQuAD dataset, we add a special indicator title to ensure the
generated context are relevant. The new prompts are changed to:
Your job is to write a wikipedia-style paragraph on a specific topic. Your written paragraph
should contains the answer to a question that asks about certain information related to the topic.
The user will first provide the topic, question, and answer and some example paragraphs.

8
Running Title for Header

Question Answer Synthetic Context


Bacteria that lack fimbriae adhere to cell surfaces Bacteria possess a variety of extracellular appendages that allow them to at-
are less likely to ____. tach to surfaces, including fimbriae. Fimbriae are thin, hair-like projections
composed of proteins that allow bacteria to adhere to surfaces such as cell
membranes in the body. Bacteria lacking fimbriae are less likely to adhere to
cell surfaces and therefore less likely to cause infection. This is due to the fact
that fimbriae act as receptors that allow bacteria to recognize and bind to certain
molecules on the surface of a cell. Without these receptors, bacteria lack the
ability to attach to cells and therefore cannot colonize or cause infection. Thus,
a lack of fimbriae can make bacteria less likely to cause infection.
Meiosis usually produces four haploid Meiosis is an important process of cell division that is vital for the production of
____ daughter cells. gametes in sexually-reproducing species. It is a unique type of cell division as it
results in the production of four haploid daughter cells, each with only one set
of chromosomes. This is in contrast to mitosis, which produces two daughter
cells with the same genetic information as the parent cell. During meiosis, the
chromosomes within the parent cell pair up and exchange genetic material with
their partner, a process known as crossing over. This leads to an increase in
genetic diversity which is beneficial for the survival of the species.
Population genetics is the how selective forces Population genetics is a field of study that investigates how selective forces,
study of: change the allele frequen- such as natural selection and genetic drift, affect the allele frequencies of a
cies in a population over population over time. By examining the changes in allele frequencies of a
time population over multiple generations, population geneticists can gain insight
into the evolutionary history of the population, such as which selective forces
have been most influential in shaping its genotype. Additionally, population
genetics can provide clues about the current health of the population and how it
may be affected by environmental change.

Table 5: Examples of question-answer pairs in OS-bio dataset, together with their synthetic context. We highlight the
portions that include the provided answers.

Secondly, for the prompting-based QG with OS-bio dataset, we use the following prompt,
Based on the context below, generate an introductory college level biology question with {a} as the
answer.

A.3 Additional Experimental Setup

Regarding synthetic context generation with the SQuAD dataset, we employ both zero-shot and few-shot in-context
learning strategies, utilizing two pre-selected examples to guide LLMs. A few-shot learning example is shown below:
“title": “Solar_energy",
“context": “Solar power is the conversion of sunlight into electricity, either directly using
photovoltaics (PV), or indirectly using concentrated solar power (CSP). CSP systems use lenses
or mirrors and tracking systems to focus a large area of sunlight into a small beam. PV converts
light into electric current using the photoelectric effect.",
“question": “What method does the photovoltaics system use to turn light into electricity?",
“answer": “photoelectric effect"
For the question generation task, we conduct all training on a single NVIDIA Quadro RTX 8000 GPU, applying a
consistent training setup across all QG tasks. Specifically, we adopt a learning rate of 0.0003 and train the model for 10
epochs and implement early stopping if the validation loss fails to improve over the most recent 3 epochs. Additionally,
we use a batch size of 8. All these training procedures are standard in fine-tuning question generation models.

You might also like