Natural Language Processing in the Era of Large La
Natural Language Processing in the Era of Large La
In recent years, LLMs have demonstrated to achieve state- 2.2 Risk of data contamination
of-the-art performance across many NLP tasks, having in turn
become the de facto baseline models to be used in many Data contamination occurs when “downstream test sets find
experimental settings (Mars, 2022). There is however evidence that their way into the pretrain corpus” (Magar and Schwartz, 2022).
the power of LLMs can also be leveraged for malicious purposes, Where an LLM trained on large collections of text has already
including the use of LLMs to assist with completion of school seen the data it is then given at test time for evaluation, the
assignments by cheating (Cotton et al., 2023), or to generate content model will then exhibit an impressive yet unrealistic performance
that is offensive or spreads misinformation (Weidinger et al., score. Research has in fact shown that data contamination can be
2022). frequent and have a significant impact (Deng et al., 2023; Golchin
The great performance of LLMs has also inevitably provoked and Surdeanu, 2023). It is therefore crucial that researchers ensure
some fear in society that artificial intelligence tools may eventually that the test data has not been seen by an LLM before, for a
take up many people’s jobs (George et al., 2023), hence questioning fair and realistic evaluation. This is however challenging, if not
the ethical implications they may have on society. This has nearly impossible, to figure out with black box models, which again
in turn sparked research, with recent studies suggesting to encourages the use of open source, transparent LLMs.
embrace AI tools as they can in fact support and boost the
performance of, rather than replace, human labor (Noy and Zhang,
2023).
2.3 Bias in LLM models
Publisher’s note organizations, or those of the publisher, the editors and the
reviewers. Any product that may be evaluated in this article, or
All claims expressed in this article are solely those of the claim that may be made by its manufacturer, is not guaranteed or
authors and do not necessarily represent those of their affiliated endorsed by the publisher.
References
Alkaissi, H., and McFarlane, S. I. (2023). Artificial hallucinations in chatgpt: Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., et al.
implications in scientific writing. Cureus 15, 2. doi: 10.7759/cureus.35179 (2020). Retrieval-augmented generation for knowledge-intensive nlp tasks. Adv. Neural
Inform. Proc.Syst. 33, 9459–9474.
Bakker, M., Chadwick, M., Sheahan, H., Tessler, M., Campbell-Gillingham, L.,
Balaguer, J., et al. (2022). Fine-tuning language models to find agreement among Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., et al. (2019). Roberta: a
humans with diverse preferences. Adv. Neural Inform. Proc. Syst. 35, 38176–38189. robustly optimized bert pretraining approach. arXiv. doi: 10.48550/arXiv.1907.11692
Belz, A., Agarwal, S., Shimorina, A., and Reiter, E. (2021). “A systematic review Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., and Neubig, G. (2023).
of reproducibility research in natural language processing,” in Proceedings of the 16th Pre-train, prompt, and predict: a systematic survey of prompting methods in
Conference of the European Chapter of the Association for Computational Linguistics: natural language processing. ACM Comput. Surv. 55, 1–35. doi: 10.1145/35
Main Volume. Kerrville, TX: Association for Computational Linguistics, 381–393. 60815
Chen, H., Jiao, F., Li, X., Qin, C., Ravaut, M., Zhao, R., et al. (2023). Chatgpt’s Magar, I., and Schwartz, R. (2022). “Data contamination: From memorization
one-year anniversary: are open-source large language models catching up? arXiv. to exploitation,” in Proceedings of the 60th Annual Meeting of the Association for
doi: 10.48550/arXiv.2311.16989 Computational Linguistics (Volume 2: Short Papers) Kerrville, TX: Association for
Computational Linguistics, 157–165.
Chang, Y., Wang, X., Wang, J., Wu, Y., Zhu, K., Chen, H., et al. (2023). A survey on
evaluation of large language models. arXiv. doi: 10.48550/arXiv.2307.03109 Mars, M. (2022). From word embeddings to pre-trained language models:
a state-of-the-art walkthrough. Appl. Sci. 12, 8805. doi: 10.3390/app121
Cinelli, M., Pelicon, A., Mozetič, I., Quattrociocchi, W., Novak, P. K., and Zollo,
78805
F. (2021). Dynamics of online hate and misinformation. Scient.Rep. 11, 22083.
doi: 10.1038/s41598-021-01487-w Maynez, J., Narayan, S., Bohnet, B., and McDonald, R. (2020). “On faithfulness
and factuality in abstractive summarization,” in Proceedings of the 58th Annual
Cotton, D. R., Cotton, P. A., and Shipway, J. R. (2023). “Chatting and cheating:
Meeting of the Association for Computational Linguistics. Kerrville, TX: Association for
Ensuring academic integrity in the era of chatgpt,” in Innovations in Education and
Computational Linguistics, 1906.
Teaching International (Oxfordshire: Routledge), 1–12.
Danilevsky, M., Qian, K., Aharonov, R., Katsis, Y., Kawas, B., and Sen, P. (2020). Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient estimation of word
“A survey of the state of explainable ai for natural language processing,” in Proceedings representations in vector space. arXiv. doi: 10.48550/arXiv.1301.3781
of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational
Min, B., Ross, H., Sulem, E., Veyseh, A. P. B., Nguyen, T. H., Sainz, O., et al. (2023).
Linguistics and the 10th International Joint Conference on Natural Language Processing
Recent advances in natural language processing via large pre-trained language models:
(Association for Computational Linguistics), 447–459.
a survey. ACM Computing Surveys 56, 1–40. doi: 10.1145/3605943
Deng, C., Zhao, Y., Tang, X., Gerstein, M., and Cohan, A. (2023). Investigating
Mitchell, M., and Krakauer, D. C. (2023). The debate over understanding
data contamination in modern benchmarks for large language models. arXiv.
in ais large language models. Proc. National Acad. Sci. 120, e2215907120.
doi: 10.48550/arXiv.2311.09783
doi: 10.1073/pnas.2215907120
Derczynski, L., Bontcheva, K., Lukasik, M., Declerck, T., Scharl, A., Georgiev, G.,
Navigli, R., Conia, S., and Ross, B. (2023). Biases in large language models: origins,
et al. (2014). “Pheme: computing veracity: the fourth challenge of big social data,” in
inventory and discussion. ACM J. Data Inform. Qual. 15, 1–21. doi: 10.1145/3597307
Proceedings of ESWC EU Project Networking (Vienna: Semantic Technology Institute
International). Noy, S., and Zhang, W. (2023). Experimental Evidence on the Productivity Effects of
Generative Artificial Intelligence. Amsterdam: Elsevier Inc.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). “Bert: Pre-training
of deep bidirectional transformers for language understanding,” in Proceedings of the Pan, X., Zhang, M., Ji, S., and Yang, M. (2020). “Privacy risks of general-purpose
2019 Conference of the North American Chapter of the Association for Computational language models,” in 2020 IEEE Symposium on Security and Privacy (SP). San Francisco,
Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), CA: IEEE, 1314-1331.
Kerrville, TX: Association for Computational Linguistics, 4171–4186.
Pennington, J., Socher, R., and Manning, C. D. (2014). “Glove: Global vectors for
Dong, L., Yang, N., Wang, W., Wei, F., Liu, X., Wang, Y., et al. (2019). “Unified word representation,” in Proceedings of the 2014 Conference on Empirical Methods in
language model pre-training for natural language understanding and generation,” in Natural Language Processing (EMNLP). Stanford, CA: Stanford University, 1532–1543.
Advances in Neural Information Processing Systems (Red Hook, NY: Curran Associates,
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., et al. (2020).
Inc.), 32.
Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach.
Gallegos, I. O., Rossi, R. A., Barrow, J., Tanjim, M. M., Kim, S., Dernoncourt, Learn. Res. 21, 5485–5551.
F., et al. (2023). Bias and fairness in large language models: a survey. arXiv.
Rawte, V., Chakraborty, S., Pathak, A., Sarkar, A., Tonmoy, S., Chadha,
doi: 10.48550/arXiv.2309.00770
A., et al. (2023). The troubling emergence of hallucination in large language
George, A. S., George, A. H., and Martin, A. G. (2023). Chatgpt and the future models-an extensive definition, quantification, and prescriptive remediations. arXiv.
of work: a comprehensive analysis of ai’s impact on jobs and employment. Partners doi: 10.18653/v1/2023.emnlp-main.155
Universal Int. Innovat. J. 1, 154–186.
Rigaki, M., and Garcia, S. (2023). A survey of privacy attacks in machine learning.
Golchin, S., and Surdeanu, M. (2023). Time travel in llms: tracing data ACM Comp. Surv. 56, 1–34. doi: 10.1145/3624010
contamination in large language models. arXiv. doi: 10.48550/arXiv.2308.08493
Rosenfeld, R. (2000). Two decades of statistical language modeling: where do we go
Guo, S., Xie, C., Li, J., Lyu, L., and Zhang, T. (2022). Threats to pre-trained language from here? Proc. IEEE 88, 1270–1278. doi: 10.1109/5.880083
models: Survey and taxonomy. arXiv. doi: 10.48550/arXiv.2202.06862
Sarsa, S., Denny, P., Hellas, A., and Leinonen, J. (2022). “Automatic generation
Gurrapu, S., Kulkarni, A., Huang, L., Lourentzou, I., and Batarseh, F. A. (2023). of programming exercises and code explanations using large language models,” in
Rationalization for explainable nlp: a survey. Front. Artif. Intellig. 6, 1225093. Proceedings of the 2022 ACM Conference on International Computing Education
doi: 10.3389/frai.2023.1225093 Research (New York, NY: Association for Computing Machinery), 27–43.
doi: 10.1145/3501385.3543957
Kotek, H., Dockum, R., and Sun, D. (2023). “Gender bias and stereotypes in large
language models,” in Proceedings of The ACM Collective Intelligence Conference (New Scao, T. L., Fan, A., Akiki, C., Pavlick, E., Ili, S., Hesslow, D., et al.
York, NY: Association for Computing Machinery), 12–24. (2023). Bloom: A 176b-parameter open-access multilingual language model. arXiv.
Li, M., Zhang, Y., Li, Z., Chen, J., Chen, L., Cheng, N., et al. (2023a). From quantity doi: 10.48550/arXiv.2211.05100
to quality: boosting llm performance with self-guided data selection for instruction
Schick, T., and Schütze, H. (2021). “Exploiting cloze-questions for few-shot text
tuning. arXiv. doi: 10.48550/arXiv.2308.12032
classification and natural language inference,” in Proceedings of the 16th Conference of
Li, Y., Du, M., Song, R., Wang, X., and Wang, Y. (2023b). A survey on fairness in the European Chapter of the Association for Computational Linguistics: Main Volume
large language models. arXiv. doi: 10.48550/arXiv.2308.10149 (Association for Computational Linguistics), 255–269.
Shi, W., Ajith, A., Xia, M., Huang, Y., Liu, D., Blevins, T., et al. (2023). Detecting Weidinger, L., Uesato, J., Rauh, M., Griffin, C., Huang, P.-S., Mellor, J.,
pretraining data from large language models. arXiv. doi: 10.48550/arXiv.2310.16789 et al. (2022). “Taxonomy of risks posed by language models,” in Proceedings of
the 2022 ACM Conference on Fairness, Accountability, and Transparency. New
Shayegani, E., Mamun, M. A. A., Fu, Y., Zaree, P., Dong, Y., and Abu-Ghazaleh,
York: Association for Computing Machinery, 214–229. doi: 10.1145/3531146.35
N. (2023). Survey of vulnerabilities in large language models revealed by adversarial
33088
attacks. arXiv. doi: 10.48550/arXiv.2310.10844
Weidinger, L., Mellor, J., Rauh, M., Griffin, C., Uesato, J., Huang, P.-S.,
Srivastava, A., Ahuja, R., and Mukku, R. (2023). No offense taken: eliciting
et al. (2021). Ethical and social risks of harm from language models. arXiv.
offensiveness from language models. arXiv. doi: 10.48550/arXiv.2310.00892
doi: 10.48550/arXiv.2112.04359
Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., et al. (2023).
Stanford Alpaca: An Instruction-Following Llama Model. Available online at: https:// Xu, Y., Su, H., Xing, C., Mi, B., Liu, Q., Shi, W., et al. (2023). Lemur: Harmonizing
github.com/tatsu-lab/stanford_alpaca (accessed December 1, 2023). natural language and code for language agents. arXiv. doi: 10.48550/arXiv.2310.
06830
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix,
T., et al. (2023). Llama: Open and efficient foundation language models. arXiv. Yin, W., and Zubiaga, A. (2021). Towards generalisable hate speech detection: a
doi: 10.48550/arXiv.2302.13971 review on obstacles and solutions. PeerJ Comp. Sci. 7, e598. doi: 10.7717/peerj-cs.598
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., et al. Zhang, Y., Li, Y., Cui, L., Cai, D., Liu, L., Fu, T., et al. (2023). Siren’s
(2017). “Attention is all you need,” in Advances in Neural Information Processing song in the ai ocean: a survey on hallucination in large language models. arXiv.
Systems (Red Hook, NY: Curran Associates, Inc.), 30. doi: 10.48550/arXiv.2309.01219
Wan, Y., Pu, G., Sun, J., Garimella, A., Chang, K.-W., and Peng, N. (2023). " kelly is a Zhao, H., Chen, H., Yang, F., Liu, N., Deng, H., Cai, H., et al. (2023).
warm person, joseph is a role model": Gender biases in llm-generated reference letters. Explainability for large language models: a survey. arXiv. doi: 10.1145/36
arXiv. doi: 10.18653/v1/2023.findings-emnlp.243 39372