Emerging trends -- When can users trust GPT, and when should they intervene (2024)
Emerging trends -- When can users trust GPT, and when should they intervene (2024)
doi:10.1017/S1351324923000578
EMERGING TRENDS
Abstract
Usage of large language models and chat bots will almost surely continue to grow, since they are so easy
to use, and so (incredibly) credible. I would be more comfortable with this reality if we encouraged more
evaluations with humans-in-the-loop to come up with a better characterization of when the machine can
be trusted and when humans should intervene. This article will describe a homework assignment, where
I asked my students to use tools such as chat bots and web search to write a number of essays. Even
after considerable discussion in class on hallucinations, many of the essays were full of misinformation
that should have been fact-checked. Apparently, it is easier to believe ChatGPT than to be skeptical. Fact-
checking and web search are too much trouble.
1. Introduction
Much has been written about GPT including Dale (2021), the most read paper in this journal.a
ChatGPT is extremely popular.
a https://www.cambridge.org/core/journals/natural-language-engineering/most-read
b https://www.reuters.com/technology/chatgpt-sets-record-fastest-growing-user-base-analyst-note-2023-02-01/
C The Author(s), 2024. Published by Cambridge University Press. This is an Open Access article, distributed under the terms of the
Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and
reproduction, provided the original article is properly cited.
Outlines NA NA Useful
.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
to prefer ChatGPT, perhaps because it is the new hot thing, or perhaps because bots are easier to
use (and require fewer clicks than web search).
I was surprised how well ChatGPT did on metaphors, but I was also surprised how poorly it
did on references. If you ask ChatGPT for references on some topic, it will make up papers that
do not stand up to fact-checking. We will discuss some of the amazingly good cases in Section 5
and some of the amazingly bad cases in Section 6.
1. explanation,
2. capturing relevant generalizations (Chomsky 1957, 1965), and
3. model size (number of parameters) (Minsky and Papert 1969).
4.4 Human-in-the-loop
Much of the adversarial literature above is focused on use cases with no human in the loop.
In Church and Yue (2023), we envisioned a human-in-the-loop collaboration of humans with
machines. To make this work, we need to characterize which subtasks are appropriate for humans
and which are appropriate for machines. Evaluations in our field tend to focus on how well the
machines are doing by themselves, as opposed to which tasks are appropriate for humans and
which are appropriate for machines.
The adversarial literature above is asking the right questions:
but we were hoping for answers that would be more helpful in human-in-the-loop settings. For
example, the first answer to the second question above in Wang et al. (2023) is
The absolute performance of ChatGPT on adversarial and OOD classification tasks is still far
from perfection even if it outperforms most of the counterparts.
This answer is not that useful in human-in-the-loop settings. We already know that the machine
is more fluent than trustworthy (on average) (Church and Yue 2023). What we would like to know
is: when can we trust the machine, and when should we ask the human to step in.
5. Amazingly good
5.1 Metaphor
I expected the following sports metaphors to be hard, especially since most of the students in the
class are not native speakers of English. But I was surprised how well ChatGPT did on these.
I asked the students to use chat bots and search engines to explain what the following terms
mean. Which sports are these terms from? What do they mean in that sport? How are they used
metaphorically outside of that sport? If the term is more common in one English speaking country
than another, can you name a country that is likely to use the term?
5.2 Documentation
When I constructed the homework, I expected ChatGPT to work well on documentation. Last
summer, some undergraduates taught me that they prefer ChatGPT to Stack Overflow. To test
this observation on my class of masters students, I asked them to do the following with vectors
from MUSE (which they had used previously):
Some people are finding ChatGPT to be useful for documentation (e.g., as a replacement for
Stack Overflow). Suppose I have a set of vectors from MUSE and I want to index them with
annoy and/or faiss.
o https://www.imdb.com/title/tt0587500/
p https://www.youtube.com/watch?v=MPhg_Hhsvhg
Most students used ChatGPT, as is, for this question. ChatGPT was amazingly good on questions
1–4.
6. Amazingly bad
6.1 Made-up references
But ChatGPT was amazingly bad on question 6. A number of the students returned the same
awful answer:
Title: “A Survey of Nearest Neighbor Search Algorithms”
Authors: Yufei Tao, Dongxiang Zhang
Link: Survey Paper (link to https://arxiv.org/abs/1904.06188)
Note that this paper does not exist, though there are a number of papers with similar titles such
as Abbasifard et al. (2014). The authors can be found in Google Scholar: Zhang Dongxiang and
Yufei Tao, though I was unable to find a paper that they wrote together. Worse, the link points to
Amanbek et al. (2020), a completely different paper on a different topic with a different title and
different authors. I believe the students used ChatGPT to find this non-existent paper. I had hoped
that the students would do more fact-checking than they did, especially after having discussed
hallucinations in class, but users do not do as much fact-checking as they should. Perhaps the
court of public opinion needs to increase the penalties for trafficking in misinformation.
This case is similar to a recent case where a lawyer relied on A.I. He “did not comprehend” that
the chat bot could lead him astray. The bot crafted a motion full of made-up case law.
As Mr. Schwartz answered the judge’s questions, the reaction in the courtroom, crammed with
close to 70 people who included lawyers, law students, law clerks and professors, rippled across
the benches. There were gasps, giggles and sighs. . . “I continued to be duped by ChatGPT. It’s
embarrassing,” Mr. Schwartz said.q
The episode, which arose in an otherwise obscure lawsuit, has riveted the tech world, where
there has been a growing debate about the dangers – even an existential threat to humanity –
posed by artificial intelligence. It has also transfixed lawyers and judges.
In addition to computer errors (hallucinations), there were also some human errors. I’m not sure
the students (and ChatGPT) understand the difference between primary literature and secondary
literature. One student confused approximate nearest neighbors with shortest paths. A survey paper
is not the same as a paper whose title contains the string: “survey.” That said, I am much more con-
cerned with trafficking in computer-generated misinformation (bogus references) than human
errors by well-meaning students that are making excellent progress on the material in class.
6.2 Essays that are not only wrong but lack depth and perspective in ways that could be
dangerous
Several questions on the homework asked the students to write essays:
Please use ChatGPT, Google and whatever other tools might be useful to do this assignment.
The point is not so much to solve the problems, but to learn how to use these tools effectively,
and to discover their strengths and weaknesses.
q https://www.nytimes.com/2023/06/08/nyregion/lawyer-chatgpt-sanctions.html
In retrospect, I wish I had been more explicit about asking for fact-checking. One of the student
essays contained the following paragraph:
During the First Opium War (1839–1842), the British government was led by the
Conservative Party under Prime Minister Sir Robert Peel. The opposition, primarily the
Whigs, had varying views on the war. Some opposed it on moral grounds, criticizing the ethics
of trading in opium, while others were concerned about the potential impact on international
relations and trade.
This paragraph includes a number of factual errors. While the dates are correct,r the Conservatives
were in the opposition at the time. Peel was the Prime Minister of England under a Conservative
Government, but not at that time.s
In fact, the Opium War had little to do with opium. Neither the government (Whigs) nor the
opposition (Conservatives) wanted to have anything to do with the drug trade. The Whigs had
just abolished slavery and considered the drug trade to be a form of slavery. The conservatives
also objected to the drug trade, though for different reasons. They viewed the drug trade as bad
for business (in textiles and tea). The name of the conflict, Opium Wars, comes from an editorial
on March 23, 1840, in the conservative newspaper: The Times, which argued that
The British would be saddled with the massive expense of an unnecessary foreign campaign
that would cost far more than the entire value of the lost opium. Platt (2019), p. 393.
The government was put in an awkward corner because, Charles Elliot, their representative in
China mishandled the situation. He convinced the drug smugglers to give him their drugs in
return for British IOUs, and then he handed over the drugs to the Chinese authorities for destruc-
tion. When Parliament discovered that they could not afford to make good on the IOUs, they
thought they could use force to get the Chinese to pay the 2 million pounds for the lost opium.
Here is the question that I gave to the students:
r https://en.wikipedia.org/wiki/First_Opium_War
s https://en.wikipedia.org/wiki/Robert_Peel
t https://www.youtube.com/watch?v=17WF0v48vGw&t=1663s
appreciate the difference between ChatGPT and an academic discussion by a historian that had
just published a book on this topic (Platt 2019).
Most of the essays from the students repeated output from ChatGPT more or less as is. These
essays tended to include a number of factual errors, but more seriously, the essays lack depth and
perspective. In Platt (2019), p. 444, Platt argued that Napoleon understood that it would be foolish
for Britain to use its short-term advantage in technology to humiliate the Chinese. Eventually,
the Chinese would do what they have done (become stronger). Since the 1920s, these events are
referred to as the “century of humiliation” by the authorities in China.u Platt makes it clear that
the current Chinese government is using this terminology to motivate its efforts to compete with
the West in technologies such as artificial intelligence. When we discussed these essays in class,
I tried to argue that over-simplifying the truth, and taking the Western side of the conflict, could
be dangerous and could lead to a trade war, if not a shooting war.
7. Conclusions
But despite such risks, usage of ChatGPT will almost surely continue to grow, since it is so easy to
use, and so (incredibly) credible. I would be more comfortable with this reality if we encouraged
more usage with humans-in-the-loop, with a better characterization of when the machine can be
trusted and when humans should intervene.
We have seen that LLMs (and ChatGPT) have much to offer and can be a useful tool for stu-
dents. However, there are risks. Users of LLMs, including students and everyone else, should be
made aware of strengths (fluency) and weaknesses (trustworthiness). Users should be expected to
do their own fact-checking before trafficking in misinformation. Laziness is inexcusable. Users are
responsible for what they say, whether or not it came from a chat bot.
That said, realistically, laziness is also inevitable. Chat bots are not going away. If others are
conservative with the truth and with fact-checking, then we will become conservative with belief.
Credibility may disappear before chat bots go away.
References
Abbasifard M.R., Ghahremani B. and Naderi H. (2014). A survey on nearest neighbor search methods. International Journal
of Computer Applications 95(25), 39–52.
Amanbek Y., Singh G., Pencheva G. and Wheeler M.F. (2020). Error indicators for incompressible darcy flow problems
using enhanced velocity mixed finite element method. Computer Methods in Applied Mechanics and Engineering 363,
112884.
Camburu O.-M., Shillingford B., Minervini P., Lukasiewicz T. and Blunsom P. (2020). Make up your mind! adversarial
generation of inconsistent natural language explanations. In Proceedings of the 58th Annual Meeting of the Association for
Computational Linguistics. Association for Computational Linguistics, pp. 4157–4165, Online.
Carbonell J.G. (1980). Metaphor - a key to extensible semantic analysis. In 18th Annual Meeting of the Association for
Computational Linguistics, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics, pp. 17–21.
Chen M., Tworek J., Jun H., Yuan Q., de Oliveira Pinto H.P., Kaplan J., Edwards H., Burda Y., Joseph N., Brockman
G., Ray A., Puri R., Krueger G., Petrov M., Khlaaf H., Sastry G., Mishkin P., Chan B., Gray S., Ryder N., Pavlov M.,
Power A., Kaiser L., Bavarian M., Winter C., Tillet P., Such F. P., Cummings D., Plappert M., Chantzis F., Barnes E.,
Herbert-Voss A., Guss W.H., Nichol A., Paino A., Tezak N., Tang J., Babuschkin I., Balaji S., Jain S., Saunders W.,
Hesse C., Carr A.N., Leike J., Achiam J., Misra V., Morikawa E., Radford A., Knight M., Brundage M., Murati M.,
Mayer K., Welinder P., McGrew B., Amodei D., McCandlish S., Sutskever I. and Zaremba W. (2021). Evaluating large
language models trained on code.
Chia Y.K., Hong P., Bing L. and Poria S. (2023). Instructeval: towards holistic evaluation of instruction-tuned large language
models. arXiv preprint arXiv:2306.04757.
Chomsky N. (1957). Syntactic Structures. The Hague: Mouton.
Chomsky N. (1965). Aspects of the Theory of Syntax. Cambridge, MA: MIT Press.
u https://en.wikipedia.org/wiki/Century_of_humiliation
Church K., Schoene A., Ortega J.E., Chandrasekar R. and Kordoni V. (2023). Emerging trends: unfair, biased, addictive,
dangerous, deadly, and insanely profitable. Natural Language Engineering, 29(2), 483–508.
Church K.W. and Yue R. (2023). Emerging trends: smooth-talking machines. Natural Language Engineering 29(5),
1402–1410.
Cinelli M., Morales G.D.F., Galeazzi A., Quattrociocchi W. and Starnini M. (2021). The echo chamber effect on social
media. Proceedings of the National Academy of Sciences 118. https://doi.org/10.1073/pnas.2023301118.
Dale R. (2021). Gpt-3: what’s it good for? Natural Language Engineering 27(1), 113–118.
Dua D., Wang Y., Dasigi P., Stanovsky G., Singh S. and Gardner M. (2019). DROP: a reading comprehension benchmark
requiring discrete reasoning over paragraphs. In Proceedings of the 2019 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis,
Minnesota. Association for Computational Linguistics, pp. 2368–2378.
Fass D. and Wilks Y. (1983). Preference semantics, ill-formedness, and metaphor. American Journal of Computational
Linguistics 9(3-4), 178–187.
Fortuna P., Soler J. and Wanner L. (2021). How well do hate speech, toxicity, abusive and offensive language classification
models generalize across datasets? Information Processing and Management 58(3), 102524.
Frohberg J. and Binder F. (2022). CRASS: a novel data set and benchmark to test counterfactual reasoning of large lan-
guage models. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, Marseille, France. European
Language Resources Association, pp. 2126–2140.
Gale W.A. and Church K.W. (1991). Identifying word correspondences in parallel texts. In Speech and Natural Language:
Proceedings of a Workshop Held at Pacific Grove, California, February 19-22, 1991.
Gedigian M., Bryant J., Narayanan S. and Ciric B. (2006). Catching metaphors. In Proceedings of the Third Workshop on
Scalable Natural Language Understanding, New York City, New York. Association for Computational Linguistics, pp. 41–48.
Hendrycks D., Burns C., Basart S., Zou A., Mazeika M., Song D. and Steinhardt J. (2020). Measuring massive multitask
language understanding. arXiv preprint arXiv:2009.03300.
Hobbs J.R. (1992). Metaphor and abduction. In Communication From an Artificial Intelligence Perspective: Theoretical and
Applied Issues. Berlin, Heidelberg: Springer, pp. 35–58.
Jia R. and Liang P. (2017). Adversarial examples for evaluating reading comprehension systems. In Proceedings of the 2017
Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark. Association for Computational
Linguistics, pp. 2021–2031.
Krishnakumaran S. and Zhu X. (2007). Hunting elusive metaphors using lexical resources. In Proceedings of the Workshop
on Computational Approaches to Figurative Language, Rochester, New York. Association for Computational Linguistics, pp.
13–20.
Lakoff G. (2008). Women, Fire, and Dangerous Things: What Categories Reveal About the Mind. Chicago: University of
Chicago Press.
Lakoff G. and Johnson M. (2008). Metaphors We Live by. Chicago: University of Chicago Press.
Lewis P., Perez E., Piktus A., Petroni F., Karpukhin V., Goyal N., Küttler H., Lewis M., Yih W.-t., Rocktäschel T., et al.
(2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing
Systems 33, 9459–9474.
Li D., Zhang Y., Peng H., Chen L., Brockett C., Sun M.-T. and Dolan B. (2021). Contextualized perturbation for tex-
tual adversarial attack. In Proceedings of the 2021 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, pp. 5053–5069.
Online.
Martin J.H. (1990). A Computational Model of Metaphor Interpretation. Cambridge, MA: Academic Press Professional, Inc.
Minsky M. and Papert S. (1969). Perceptron: An Introduction to Computational Geometry, vol. 19(18), expanded edn.
Cambridge: The MIT Press, p. 2.
Mohammad S., Shutova E. and Turney P. (2016). Metaphor as a medium for emotion: An empirical study. In Proceedings
of the Fifth Joint Conference on Lexical and Computational Semantics, Berlin, Germany. Association for Computational
Linguistics, pp. 23–33.
Morris J., Lifland E., Yoo J.Y., Grigsby J., Jin D. and Qi Y. (2020). TextAttack: a framework for adversarial attacks, data
augmentation, and adversarial training in NLP. In Proceedings of the 2020 Conference on Empirical Methods in Natural
Language Processing: System Demonstrations. Association for Computational Linguistics, pp. 119–126. Online.
Platt S.R. (2019). Imperial Twilight: The Opium War and the End of China’s Last Golden Age. Vintage.
Poletto F., Basile V., Sanguinetti M., Bosco C. and Patti V. (2020). Resources and benchmark corpora for hate speech
detection: a systematic review. Language Resources and Evaluation 55(2), 477–523.
Qazvinian V., Rosengren E., Radev D.R. and Mei Q. (2011). Rumor has it: identifying misinformation in microblogs.
In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, Edinburgh, Scotland, UK.
Association for Computational Linguistics, pp. 1589–1599.
Shu K., Sliva A., Wang S., Tang J. and Liu H. (2017). Fake news detection on social media: a data mining perspective. ACM
SIGKDD Explorations Newsletter 19(1), 22–36.
Shutova E. (2010). Models of metaphor in NLP. In Proceedings of the 48th Annual Meeting of the Association for Computational
Linguistics, Uppsala, Sweden. Association for Computational Linguistics, pp. 688–697.
Srivastava A., Rastogi A., Rao A., Shoeb A.A.M., Abid A., Fisch A., Brown A. R., Santoro A., Gupta A., Garriga-Alonso A.,
et al. (2023). Beyond the imitation game: quantifying and extrapolating the capabilities of language models. Transactions
on Machine Learning Research.
Vicario M.D., Bessi A., Zollo F., Petroni F., Scala A., Caldarelli G., Stanley H.E. and Quattrociocchi W. (2016). The
spreading of misinformation online. Proceedings of The National Academy of Sciences of The United States of America
113(3), 554–559.
Wang B., Xu C., Wang S., Gan Z., Cheng Y., Gao J., Awadallah A.H. and Li B. (2021). Adversarial glue: a multi-task
benchmark for robustness evaluation of language models. ArXiv, abs/2111.02840.
Wang J., Hu X., Hou W., Chen H., Zheng R., Wang Y., Yang L., Huang H., Ye W., Geng X., Jiao B., Zhang Y. and Xie X.
(2023). On the robustness of chatgpt: an adversarial and out-of-distribution perspective. ArXiv, abs/2302.12095.
Wei J., Wang X., Schuurmans D., Bosma M., Xia F., Chi E., Le Q.V., Zhou D., et al. (2022). Chain-of-thought prompting
elicits reasoning in large language models. Advances in Neural Information Processing Systems 35, 24824–24837.
Ziegler D.M., Nix S., Chan L., Bauman T., Schmidt-Nielsen P., Lin T., Scherlis A., Nabeshima N., Weinstein-Raun B.,
Haas D., Shlegeris B. and Thomas N. (2022). Adversarial training for high-stakes reliability. ArXiv, abs/2205.01663.
Zubiaga A., Aker A., Bontcheva K., Liakata M. and Procter R. (2017). Detection and resolution of rumours in social media.
ACM Computing Surveys (CSUR) 51, 1–36.
Cite this article: Church K. Emerging trends: When can users trust GPT, and when should they intervene? Natural Language
Engineering https://doi.org/10.1017/S1351324923000578