0% found this document useful (0 votes)
38 views

ChatGPT Coding CompSac 23

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

ChatGPT Coding CompSac 23

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Investigating Code Generation Performance of

ChatGPT with Crowdsourcing Social Data

Abstract—The recent advancements in Artificial Intelligence,


particularly in large language models and generative models,
are reshaping the field of software engineering by enabling
innovative ways of performing various tasks, such as program-
ming, debugging, and testing. However, few existing works have
thoroughly explored the potential of AI in code generation and
software developers’ attitudes toward AI-assisted coding tools.
This knowledge gap leaves it unclear how AI is transforming
software engineering and programming education. This paper
presents a scalable crowdsourcing data-driven framework to
investigate the code generation performance of generative large
language models from diverse perspectives across multiple social
media platforms. Specifically, we utilize ChatGPT, a popular
generative large language model, as a representative example Fig. 1. ChatGPT writes the bubble sort algorithm in Python
to reveal its insights and patterns in code generation. First, we
propose a hybrid keyword word expansion method that integrates
words suggested by topic modeling and expert knowledge to filter models in coding writing. However, many of these studies
relevant social posts of interest on Twitter and Reddit. Then we
collect 316K tweets and 3.2K Reddit posts about ChatGPT’s are based on case studies, with limited consideration of
code generation, spanning from Dec. 1, 2022 to January 31, broader applications in software development. The emerg-
2023. Our data analytics show that ChatGPT has been used ing OpenAI’s ChatGPT, a member of GPT-3 LLM family,
in more than 10 programming languages, with Python and demonstrates promising performance in code generation, at-
JavaScript being the two most popular, for a diverse range tracting widespread attention from stakeholders in software
of tasks such as code debugging, interview preparation, and
academic assignment solving. Surprisingly, our analysis shows engineering. As shown in Figure 1, ChatGPT can generate the
that fear is the dominant emotion associated with ChatGPT’s bubble sort algorithm in Python with the prompt of “write the
code generation, overshadowing emotions of happiness, anger, bubble sort in Python.” Some studies have explored the use
surprise, and sadness. Furthermore, we construct a ChatGPT of ChatGPT for code generation tasks [5]–[7]. Nonetheless,
prompt and corresponding code dataset by analyzing the screen- these studies did not comprehensively evaluate the overall
shots of ChatGPT code generation shared on social media. This
dataset enables us to evaluate the quality of the generated code, effectiveness of ChatGPT as a code generation and assistance
and we will make the dataset available to the public soon. We tool on a large scale.
believe the insights gained from our work will provide valuable It is challenging to conduct a large-scale study on the
guidance for future research on AI-powered code generation. performance of LLMs in code generation due to the following
Index Terms—ChatGPT, Coding Generation, Software Engi-
neering, Large Language Models (LLMs), Generative Models, reasons. First, programming languages exhibit diverse syntax
Social Media and are applicable to a wide range of tasks. For instance, SQL
is primarily utilized in database operations, while JavaScript
I. I NTRODUCTION is commonly used in web programming. Second, code gen-
Recently, the advancements in large language models eration encompasses numerous programming tasks, including
(LLMs) and generative models have revolutionized many ap- debugging, testing, and programming, for various stakehold-
plications, including free text generation, question answering, ers. Moreover, conducting user studies in the lab to investi-
and document summarization, enabling a wide range of real- gate the code generation of LLMs can be costly and time-
world services such as AI robot lawyers [1] and AI music consuming. Therefore, conducting a comprehensive study on
co-creation [2]. The field of coding, which involves writing the performance of LLMs that covers numerous programming
tasks in certain programming languages, is also benefiting languages, tasks, and stakeholders poses significant challenges.
from the rapid development of generative LLMs. However, To address the aforementioned challenges, this paper pro-
unlike traditional writing tasks, programming requires strict poses a scalable crowdsourcing data-driven framework that
adherence to syntax and logic rules, making it more challeng- integrates multiple social media data sources to examine
ing for generative models to produce high-quality code. the code generation performance of ChatGPT. The proposed
Several studies have investigated the potential of LLMs framework comprises three key components, namely keyword
in software development. For instance, Barke et al. [3] and expansion, data collection, and data analytics. Specifically, we
Vaithilingam et al. [4] examined user perceptions of generative utilize topic modeling and expert knowledge to identify all

1
keywords that are relevant to programming in the context of and moderate experience in using Copilot and IntelliSense.
ChatGPT, thus expanding the seed keyword of “ChatGPT.” By quantitative and qualitative analysis, they observed that
Using these expanded keywords, we retrieve 316K tweets and participants who used Copilot failed to complete tasks more.
3.2K Reddit posts related to ChatGPT’s code generation from Unlike the above case study works, we investigate a large
December 1, 2022, to January 31, 2023. dataset of feedback from users of the code generation tool
Furthermore, we conduct a comprehensive analysis using ChatGPT. Besides the user reactions, our study also examines
multimodal data (text and images) to answer the following the performance and limitations of ChatGPT.
research questions: As ChatGPT gains more attention recently, some researchers
1) What are the most popular programming languages in have studied its use for code generation [5]–[7]. Aljanabi et
ChatGPT usage? al. [5] discussed the possibility of using ChatGPT as a code
2) What programming scenarios, tasks, and purposes are generation tool. Avila et al. [6] have described how ChatGPT’s
people using ChatGPT for? programming capability can be used for developing online
3) What is the temporal distribution of the discussion on behavioral tasks, such as concurrent reinforcement schedules,
ChatGPT code generation? in HTML, CSS, and JavaScript code. In their work, they
4) How do stakeholders perceive ChatGPT code generation? created files with extensions .html, .css, and .js, which include
5) What are the prompts to generate code? the basic structure of the page, such as headings, linking
6) What is the quality of the code generated by ChatGPT? with styling elements, and other dynamic elements. In contrast
to the above works, we analyzed the performance of code
To the best of our knowledge, the proposed work represents
generation using ChatGPT.
the first systematic study on emerging generative models in
When it comes to complex coding, there is always a
code writing and testing. In this paper, we summarize our
chance of unidentified bugs, which may lead to a code crash.
contributions as follows:
Automated Program Repair (APR) is a concept introduced to
• We have proposed a scalable crowdsourcing and social provide automatic fixes for detected errors. In recent times,
data-driven framework for investigating the code genera- deep learning has enabled APR, and many tools using the con-
tion capabilities of ChatGPT. cept of Large Language Models (LLMs) with the Transformer
• We have presented a novel hybrid keyword expansion technique have been developed. LLMs are giving better results
method that incorporates words recommended by topic for many code-related tasks, and researchers have started to
modeling and experts to ensure that most of the related use them for APR [12]. Tools such as Codex, CodeBERT, and
social media posts are matched during data collection. more recently ChatGPT use LLMs for code fixing.
• Our study considers multiple social media platforms What makes ChatGPT stand out is its ability to discuss the
(Twitter and Reddit) and multimodal data (text and im- source code with user interaction. Sobania et al. (2023) [13]
age) to mitigate potential biases caused by a single data conducted an experiment to test the efficiency of ChatGPT in
source or data type. bug fixing compared to other tools like Codex and CoCoNut.
• We have provided data analytics from multiple perspec- About 40 of QuixBugs benchmark problems containing erro-
tives, including topic inference, sentiment analysis, and neous code were given to ChatGPT to provide solutions. The
data quality measurement. experiment showed that ChatGPT’s performance is similar to
• We have built a real-world programming dataset contain- other APR tools like Codex. However, when given more infor-
ing the ChatGPT prompt and the associated generated mation about the problem through its dialogue box, ChatGPT’s
code. This dataset will be released to the entire software performance improved, with a success rate of 77.5%.
engineering community soon. Although ChatGPT works fine with simple code logic, it can
be challenging to describe complex needs, such as designing
II. R ELATED W ORK
a web browser where user needs should be satisfied, with the
As programming generation and assistant tools, such as simple, machine-readable instructions that ChatGPT or other
CodeBERT [8] and IntelliCode Compose [9], become more AI tools use to produce code [14]. Chatterjee and Dethlefs [15]
widely used, there has been an increased focus on investi- have claimed that code generated by ChatGPT is also gender
gating the usability and interactions between users and code and race biased, questioning the efficiency of the model.
generation tools [3], [4], [10], [11]. For example, Barke et Previous works and tools for automated code generation
al. identified two interaction modes between programmers mostly relied on using neural networks [16], [17], which
and code generation tools: acceleration mode and exploration cannot match the current performance of GPT-3 models. Ray-
mode, by observing how 20 programmers solved various tasks chev et al. [18] proposed a code completion technique using
using the code generation tool Copilot [3]. Although this statistical language models to discover highly rated sentences
paper discussed a grounded theory of how programmers might and recommend code completion suggestions. Sun et al. [19]
interact differently with Copilot, it only investigated a limited introduced a novel tree-based neural architecture that incorpo-
number of users and did not reveal the reactions from users. rates grammar rules and AST structures into the network, and
Vaithilingam et al. [4] performed a study on 24 partici- it has been shown to have the best accuracy among all neural
pants consisting of different groups of people with minimal network-based code generation methods. Ciniselli et al. [20]

2
conducted a detailed empirical study on BERT models for code Specifically, we leveraged Twitter Streaming APIs to sample
completion and evaluated the percentage of perfect predictions tweet streams containing the keyword “ChatGPT” for over 55
that match developer-written code snippets. However, while hours. In total, we collected 158,452 tweets, including original
BERT models offer a potential solution for code completion, tweets, retweets, and replies. After removing duplicate tweets,
their performance is lower compared to LLM models such as we had 63,716 unique tweets. We then applied the LDA model
Copilot and ChatGPT. to infer topics based on these unique tweets, with the hope
In summary, there is a scarcity of research that delves into of discovering programming-related topics. We evaluated the
the applications of AI-assisted code generation tools. ChatGPT number of topics ranging from 1 to 30 and found that the
has risen as a prevalent option among these tools. The current convergence score achieved a relatively high and stable value
study aims to assess the effectiveness of ChatGPT as a code with the number of topics set as 22. For more details, please
generation tool. To our knowledge, this is the first research see Figure 10. After examining the 22 topics, we identified
that employs a dataset from social media to evaluate the one of them as “Programming,” consisting of the following
performance of a code generation tool. words: ask, stack, knew, write, error, diffus, run, python, stabl,
scientist, email, straight, shock, gener, comput, command, use,
III. M ETHOD code, notic, brain, bug, statement, think, dead, question, admit,
In this section, we present the proposed data-driven gen- happen, result, and overflow.
erated coding investigation framework by introducing how Combining the words in the topic of Programming, we come
to collect data of interest, how to analyze data, and how to up with the following keyword list – algorithm, algorithms,
interpret findings. bug, bugs, c#, c++, code, coding, command, commands, com-
piler, computing, debug, debugging, error, go, interpreter, java,
A. Framework Overview javascript, libraries, php, program, programming, python, r,
The overview of the proposed framework is illustrated Ruby, shell, software, sql, stack overflow, swift, test, testing,
in Figure 2. It consists of three key components: Keyword typescript – to crawl ChatGPT related code generation posts.
Selection, Data Collection, and Data Analytics. Unlike tradi-
tional user study-based research, the crowdsourcing platform C. Data Collection
is designed to be flexible and scalable, enabling the study of Based on the above carefully curated keywords, we leverage
a large population over a long period of time. We will delve two social media platforms, i.e., Twitter and Reddit to collect
into each component in detail, examining the performance of data for further analytics.
LLMs in code generation. 1) Twitter Data: Instead of relying on Twitter Streaming
B. Keyword Selection for Software Development APIs, we opt to use the Twitter Historical Data Search APIs
to create our Twitter dataset for the following reasons: 1)
To ensure the quality of the collected data, we employ a The streaming data is time-sensitive, making it impossible to
hybrid approach that combines data-driven keyword expansion retrieve older data from the debut of ChatGPT if the streaming
and expert-based keyword selection. This approach ensures data collection was not be launched at that time; 2) If we
that the data is comprehensive and precise, eliminating the risk only investigate the latest data (e.g., after Feb 1, 2023), it
of bias or incompleteness in the selection of query keywords. will introduce bias as we do not know when the performance
As ChatGPT is one of the most popular LLMs that supports of ChatGPT’s code generation was most widely discussed on
code generation, we use ChatGPT as the seed keyword to social media. On the other hand, the historical tweets span
sample Twitter streams, harvesting tweets that mention this the entire evolution of ChatGPT and provide a sample of user
term. We then perform topic modeling to determine whether comments since its release, enhancing the representativeness
a coding-related topic is present. If a coding-related topic and completeness of the crowdsourced opinions and feedback.
is observed, we add the words belonging to this topic to Twitter provides two historical data search APIs, i.e., 30-
the expanded keyword set. If a coding-related topic is not Day Search API1 that allows for retrieval of the last 30
observed, we conduct a co-occurrence word analysis and days and the Full-archive Search API2 that provides tweets
calculate the semantic similarity with the word coding to since 2006 when the first tweet was posted. Since ChatGPT
expand the candidate keywords. was released on November 30, 2022, we choose the Full-
However, the data-driven keyword expansion method may archive Search API to harvest the data. Specially, we use
result in false positives, i.e., keyword candidates that may Twitter’s Academic Research API, which supports full-archive
not be relevant to AI-based code generation. Therefore, we tweet search, to retrieve ChatGPT-related data from November
manually examine all recommended keyword candidates to 30, 2022 to February 1, 2023, with the query configured to
ensure the quality of the collected data. We first filter out only cover English tweets and exclude retweets by setting “-
irrelevant keywords and propose multiple combinations of is:retweet lang:en.” In addition to the tweet text, we collect
keywords to control the precision of data collection. For related Twitter media information (e.g., posted images) to
example, instead of collecting all postings containing Chat-
GPT, collecting postings containing both ChatGPT and coding 1 https://tinyurl.com/2s4xt8r7

makes the retrieved data more accurate and representative. 2 https://tinyurl.com/ehbsjx6v

3
Keyword Expansion Data Collection Data Analytics

What are the most popular


Topics programming languages in
Seed Data ChatGPT usage?
Topic Model Topic 1
Keyword Processing Text Data NLP & NLU Topic 2 What programming purposes are
Topic 3 people using ChatGPT for?
Full-archive Search API Twitter Data What is the temporal distribution

Keywords
Examine Keywords in of the discussion on ChatGPT
code generation?
Coding Topic
How do stakeholders perceive
Image ChatGPT code generation?
Image Data
Data Understanding What are the prompts to generate
Generate Keywords by Processing code?
Experts What is the quality of the code
Expert Pushshift Reddit API Reddit Data generated by ChatGPT?

Fig. 2. Overview of the proposed crowdsourcing framework to investigate the programming capabilities of LLMs

support fine-grained analysis. For this study, we collected Given that Twitter allows users to utilize #hashtags to
316K tweets posted between Dec. 1, 2022, and Jan. 31, 2023. indicate related topics and enhance visibility through searches,
2) Reddit Data: Unlike Twitter, where the structure is we also present the distribution of #hashtags in the collected
based on users following one another, Reddit is structured tweets. However, as #hashtags are rarely used on Reddit, we
around communities where posts on similar topics are grouped do not perform this analysis for Reddit submissions.
together. These communities are referred to as “subreddits” 2) Image Understanding: As ChatGPT is a text generative
on Reddit. For instance, the subreddit /r/aww is a community model, it is expected that most images related to ChatGPT,
where users share cute and cuddly pictures. The initial posts particularly those related to code generation, posted on so-
on Reddit are known as “submissions,” and the responses to cial media will be text-rich. To make these images more
these posts are called “comments.” informative and easier to process for downstream tasks, we
To assess the performance of ChatGPT in code gen- suggest using an Optical Character Recognition (OCR) based
eration on Reddit, we concentrate on four well-known approach to convert the collected images into text. We apply
subreddits, namely /r/ChatGPT, /r/coding, /r/github, and multiple OCR methods, including OpenCV-based pytesseract3
/r/programming. To gather submissions from these subred- and deep learning-based easyOCR4 , to the collected image
dits, we use the Search Reddit Submissions Endpoint (/red- dataset. After thoroughly evaluating the OCR detection results,
dit/search/submission) through the Pushshift Reddit API [21]. we choose easyOCR to identify and extract text from the
Similar to Twitter data, multimedia data including images images accurately.
in Reddit submissions are also retrieved. For this study, we 3) Code Reconstruction from Image: To reconstruct the
collected 3.2K Reddit submissions posted between Dec. 1, code generated by ChatGPT, it is crucial to identify the images
2022, and Jan. 31, 2023, to analyze the code generation that contain generated code. After examining the screenshots
performance of ChatGPT. of coding snippets, we found that all ChatGPT-generated code
snippets contained the keyword “Copy code” in the top-right
D. Data Analytics and Pattern Recognition corner of the coding block, as shown in Figure 1. Therefore,
We primarily employ natural language processing and image we selected all images containing the ”Copy code” keyword
understanding techniques to analyze text and image data to for further analysis.
uncover insights and identify patterns. We proposed two methods to recover the code generated
1) Text Based Topic Discovery: To obtain a comprehensive by ChatGPT. The first one is to extract the code directly
understanding of the use of ChatGPT in code generation on from the OCR results. We found that it is crucial to address
Twitter and Reddit, we employ latent Dirichlet allocation any indentation issues for indentation-sensitive programming
(LDA) [22], a widely used topic modeling technique, to languages, such as Python, as a high percentage of errors can
uncover hidden topics in the collected tweets and Reddit occur due to improper indentation. However, automatically in-
submissions. We treat each tweet or submission content as denting any given code can be a complex and challenging task.
an individual document and the entire collection of tweets A simple script that looks for loops and specific statements to
or submissions as the corpus of documents. In the text pre- increase and decrease the indentation count does not work on
processing stage, we implement commonly used techniques all codes, especially if the code has multiple indentation styles
such as removing stop words and frequently occurring words and conditional statements.
like “ChatGPT,” tokenization, and lemmatization of words. We An alternative method to obtain the code is reproducing it
then perform term frequency-inverse document frequency (TF- using the identical prompt. Specifically, we can identify the
IDF) on the combined documents to form a TF-IDF-based cor- prompt and input it into ChatGPT web services5 to generate
pus, on which latent topics are extracted using LDA models. the code. Once we have downloaded the newly produced code,
Following previous research on big social data analysis [23], we can assess and evaluate it. In our study, we adopted this
[24], we select the Cv metric to determine the appropriate reliable method to reconstruct the code generated by ChatGPT.
number of topics. This metric is known to be one of the 3 https://pypi.org/project/pytesseract/
best coherence measures as it combines normalized pointwise 4 https://github.com/JaidedAI/EasyOCR

mutual information (NPMI) and cosine similarity [25]. 5 https://openai.com/blog/chatgpt/

4
4) Sentiment Analysis: Considering that ChatGPT may B. Topics Related to Code Generation
trigger diverse emotions in code generation, we do not think We generated topics for the tweets containing keyword
the three categories of positive, negative, and neutral can “ChatGPT” and programming related words using the LDA
cover all involved emotions. In order to accurately reflect model. Based on the coherence score presented in Figure 4,
the various and complex emotions expressed in social media we select 17 topics finally. The 17 topics and the word list
users’ comments, we choose to categorize them into more of each topic are presented in Table II (see Appendix B).
inclusive emotions: Happy, Angry, Sad, Surprise, and Fear. To The topic modeling results indicate that ChatGPT has been
achieve this, we utilize Text2Emotion [26], a Python package used for different purposes regarding code generation, such
capable of analyzing sentiments and categorizing them into as debugging codes (topic 9, topic 13, topic 17), testing
the aforementioned five emotions. codes/algorithms (topic 5, topic 16), preparing programming
5) Code Quality Evaluation: To assess the quality of the interview (topic 2 and topic 4), working on programming-
code generated by ChatGPT, we are utilizing Flake8 [27], related assignments (topic 3, topic 6), and other related tasks.
which is a wrapper around PyFlakes, pycodestyle, and Ned Twitter users also notice that ChatGPT’s capacity in code
Batchelder’s McCabe script. Flake8 allows the use of any generation is limited (topic 1). Still, the ethic issues and social
of these tools by launching Flake8, and it assigns a unique responsibility aspect of ChatGPT have not been discussed
code number to each error code. The output of warnings and much among users.
errors is displayed per file. We choose Flake8 as our evaluation
tool because it is one of the most powerful and flexible 0.55

Coherence score Cv
tools available, providing a wide range of error codes while 0.50
remaining fast to run checks. Flake8 is particularly effective 0.45
when checking for correctness and whitespace issues, making 0.40
it an ideal choice for our purposes. 0.35
IV. E VALUATION AND F INDINGS 0.30

In this section, we present the evaluation results and high- 0 5 10 15 20 25 30


light our findings on the performance of code generation
Number of topics
by ChatGPT. We summarize the topics discussed in social Fig. 4. Coherence scores of LDA with different number of topics
media posts, the strengths and weaknesses of ChatGPT’s code
generation capabilities. To further investigate the implications and impacts of Chat-
GPT on different AI technologies, applications, and industries,
A. Programming Language Distribution we extracted the hashtag-based topics, which is shown in
ChatGPT supports code generation for multiple program- Figure 5. The hashtags we use include: #AI, #OPENAI,
ming languages. We illustrate the popularity of the top 12 #Artificialintelligence, #Programming, #Python, #Coding, and
programming languages across Twitter and Reddit in Figure others. We group the hashtags into five clusters: ChatGPT, AI
3. We can see that python is the most popular language among & ML & DS, Company, Programming, and Other Tech. Based
both communities and far ahead of other languages. Obviously, on the topic frequency in Figure 5, we can see that ChatGPT
python has become the top 1 program language in many fields, has a great impact on AI and its related fields. Both academia
such as artificial intelligence, machine learning, data analytics, and IT industry need to pay attention to this new technology.
automation, scientific computing, and others. JavaScript, R,
and Shell/Bash, among the most popular programming lan- 0.06 ChatGPT
AI & ML & DS
Frequency

guages nowadays, are also well-supported by ChatGPT. 0.04 Company


Programming
0.02 Other Tech
Twitter 0.00
0.4 Reddit
#CHATGPT

#ArtificialIntelligence

#Technology
#AI
#OPENAI
#Programming
#Python
#Coding
#ChatBOT
#GPT3
#ChatGPT3

#TECH

#CODE
#MachineLearning
#GOOGLE
#ML

#WEB3
#DATASCIENCE
#JavaScript

#Crypto
#SEO
#Innovation
#MICROSOFT
#OpenAIChatGPT
#CyberSecurity
#100DaysOfCode
#Developer

#Software
#GenerativeAI

#BITCOIN

0.2

0.0
c#
c++
java

php
javascript

python
r
ruby
shell
sql
swift
typescript

Fig. 5. Hashtag-based topics. We exclude the 35.4% ratio of the #ChatGPT


during visualization to prevent it from overpowering other topics

C. Temporal Distribution
Fig. 3. Programming language distribution Temporal analysis can be used to examine the popularity
over time. Figure 6 visualizes the daily distribution of posts on

5
Twitter (blue) and Reddit (yellow) related to ChatGPT’s code TABLE I
generation in the first two months after its launch. ChatGPT C ODE QUALITY RESULTS BY F LAKE 8
discussion spread faster on Twitter than Reddit. We observe a
Code Description Percentage
peak of the ChatGPT code generation on Twitter and Reddit
at the end of the first week of the release of ChatGPT. The E501 line too long (114 >79 characters) 29.39%
E231 missing whitespace after ’,’ 15.54%
popularity decreased from the second week, but somehow still E302 expected 2 blank lines, found 1 13.51%
very popular on both platform. Even after two months, the W293 blank line contains whitespace 10.14%
attention on ChatGPT is still stable, indicating ChatGPT is E402 module level import not at top of file 5.41%
E305 expected 2 blank lines after class, found 1 4.73%
helpful for code generation. E265 block comment should start with ’# ’ 4.39%
W291 trailing whitespace 2.70%
E999 SyntaxError: invalid syntax 2.36%
4 Twitter E227 missing whitespace around bitwise or shift operator 1.69%
Reddit
PDF(%)

W292 no newline at end of file 1.69%


E101 indentation contains mixed spaces and tabs 1.35%
2 F401 ’torch’ imported but unused 1.35%
W191 indentation contains tabs 1.35%
W391 blank line at end of file 1.35%
0 E261 at least two spaces before inline comment 1.01%
0 10 20 30 40 50 60 E225 missing whitespace around operator 0.68%
From Dec. 1, 2022 to Jan. 31 2023 F811
E902
redefinition of unused ’pymesh’ from line 5
TokenError: EOF in multi-line statement
0.34%
0.34%
F821 undefined name ’output value’ 0.34%
Fig. 6. Daily distribution of posts related to ChatGPT’s code generation in
E741 ambiguous variable name ’I’ 0.34%
the first two months after its launch

D. Sentiment on Code Generation


We constructed a dataset of .py files for all Python-related
Figure 7 presents sentiment analysis results for code gener- prompts, with each .py file containing the prompt and the
ation in eight programming languages across two social media corresponding code generated by ChatGPT. Figure 9 shows
platforms: Twitter and Reddit. The emotions were categorized a sample .py file from the dataset, where the prompt is
into five distinct groups: happiness, anger, surprise, and fear. commented at the beginning of the file. The complete dataset
Overall, fear is the most commonly expressed emotion on
will be publicly released soon to the software engineering
both platforms for all programming languages, likely due to
community.
concerns about ChatGPT’s code quality and its potential im-
pact on human jobs. We observed that fear is more frequently F. Code Quality Evaluation
expressed for SQL, Java, and C#.
We submitted the Python code snippets generated by Chat-
The second most commonly expressed emotion is sadness
GPT to Flake8 as individual .py files to check for quality and
for Python, JavaScript, R, C++, and Shell on the Twitter
errors. Flake8 identified the error codes for each file, along
platform. This may also be linked to concerns about Chat-
with the position and description of the error. After evaluating
GPT’s potential impact on human jobs. Surprise is the third
the code snippets using Flake8, we found that the majority
most commonly expressed emotion for Python, JavaScript, R,
of the errors are pycodestyle errors, with code E (80.74%),
Shell, and C++. The surprise may result from the quality of
followed by code W (17.25%). The least number of errors are
the generated code. Happiness and anger are the two least
attributed to pyflake with code F (2.03%). Among the unique
frequently observed emotions.
error codes, there are 13 for E, with the majority of errors
We also compared the sentiment analysis results on both
linked to code E501 (line too long). Additionally, there are five
platforms and found similar patterns for all programming
unique W codes and three unique F codes. Table IV-F provides
languages except SQL and C++. For SQL, Reddit users
a detailed summary of the evaluation results, including the
expressed more sadness than Twitter, possibly due to their
percentage of each Flake8 code for the overall evaluation.
greater knowledge about SQL and concerns about the quality
of ChatGPT’s code generation. Regarding C++, we observed V. C ONCLUSION
that Reddit users showed more happiness than Twitter users,
which may indicate less worry about ChatGPT’s potential This paper presents a framework for exploring the code
threat to their jobs. generation capabilities of ChatGPT through the analysis of
crowdsourced data on Twitter and Reddit. The results show
E. A Public Dataset of Prompts and Generated Code that Python and JavaScript are the most frequently discussed
From the OCR results of Twitter images, we identified programming languages on social media and that ChatGPT
and extracted 332 prompts. Figure 8 provides a wordcloud is used in a variety of code generation domains, e.g., debug-
overview of all extracted prompts, where Python-related ques- ging codes, preparing programming interviews, and solving
tions are the most common. In particular, Twitter users prefer academic assignments. Sentiment analysis reveals that people
words such as “write”, “code”, “function”, and “program” generally have fears about the code generation capabilities of
when constructing their coding prompts. ChatGPT, rather than feeling happy, angry, surprised, or sad.

6
Happy Happy Happy Happy
0.6 0.6 0.6 0.6

0.4 0.4 0.4 0.4

Fear 0.2 Angry Fear 0.2 Angry Fear 0.2 Angry Fear 0.2 Angry
Twitter Twitter Twitter Twitter
Reddit Reddit Reddit Reddit

Sad Surprise Sad Surprise Sad Surprise Sad Surprise


(a) Python (b) JavaScript (c) R (d) Shell
Happy Happy Happy Happy
0.6 0.6 0.6 0.6

0.4 0.4 0.4 0.4

Fear 0.2 Angry Fear 0.2 Angry Fear 0.2 Angry Fear 0.2 Angry
Twitter Twitter Twitter Twitter
Reddit Reddit Reddit Reddit

Sad Surprise Sad Surprise Sad Surprise Sad Surprise


(e) SQL (f) C++ (g) Java (h) C#

Fig. 7. Sentiment analysis results on code generation for eight programming languages

Fig. 8. WordCloud of prompts Fig. 9. A sample in the public dataset of prompts (Line 3) and generated
code (Line 4 - Line 17)

The study also includes the construction of a code generation


prompt dataset, which will be made publicly available, and corpus. Topic coherence scores a single topic by combining
an evaluation of the quality of code generated by ChatGPT normalized pointwise mutual information (NPMI) and the
using Flake8. We hope this work provides valuable insights cosine similarity between words in the topic [25]. The higher
into the adoption of ChatGPT in software development and the coherence score, the higher the quality of the generated
programming education. topics; however, low quality topics may be composed of highly
A PPENDIX unrelated words that cannot fit into another topic, leading a
low coherence score [25]. In our corpus, we evaluated the
A. Coherence Scores of LDA with Different Number of Topics topic numbers ranging from one to thirty with 500 passes,
One of the most important steps for applying topic modeling and we repeated the experiments five times in each step when
such as LDA is to select an appropriate number of topics generating the topics to avoid random errors in Cv metric.
contained by the corpus [28]. The reason is that choosing too Figure 10 presented the evaluation results on all the tweets
few topics will produce over-broad topics while choosing too containing keyword “ChatGPT”. In this figure, the horizontal
many topics will lead to lots of overlapping between topics. In axis indicates the number of topicsm, the vertical axis indicates
this study, we choose the Cv metric, a widely used coherence the coherence score, the top in the shadow represents the max
measurement to decide the optimal number of topics in our coherence score and the bottom represents the min coherence

7
score with the number of topics set differently. Since either [10] E. Jiang, E. Toh, A. Molina, K. Olson, C. Kayacik, A. Donsbach, C. J.
the selected number of topics (k) is too big (i.e., k > 30) Cai, and M. Terry, “Discovering the syntax and strategies of natural lan-
guage programming with generative language models,” in Proceedings
or too small (i.e., k<5) will make the topic interpretation of the 2022 CHI Conference on Human Factors in Computing Systems,
problematic, we finally selected 22 topics for the highest 2022, pp. 1–19.
coherence score between 5 to 30 topics. [11] F. F. Xu, B. Vasilescu, and G. Neubig, “In-IDE code generation
from natural language: Promise and challenges,” ACM Transactions on
Software Engineering and Methodology (TOSEM), vol. 31, no. 2, pp.
0.50 1–47, 2022.
Coherence score Cv

[12] C. S. Xia and L. Zhang, “Conversational automated program repair,”


0.45 arXiv preprint arXiv:2301.13246, 2023.
[13] D. Sobania, M. Briesch, C. Hanna, and J. Petke, “An analysis of
0.40 the automatic bug fixing performance of chatgpt,” arXiv preprint
arXiv:2301.08653, 2023.
0.35 [14] D. Castelvecchi, “Are chatgpt and alphacode going to replace program-
mers?” Nature, 2022.
0.30 [15] J. Chatterjee and N. Dethlefs, “This new conversational ai model can
0 5 10 15 20 25 30 be your friend, philosopher, and guide... and even your worst enemy,”
Number of topics Patterns, vol. 4, no. 1, p. 100676, 2023.
[16] J. Li, Y. Wang, M. R. Lyu, and I. King, “Code completion
Fig. 10. Coherence scores of LDA with different number of topics with neural attention and pointer networks,” in Proceedings of
the Twenty-Seventh International Joint Conference on Artificial
Intelligence, IJCAI-18. International Joint Conferences on Artificial
Intelligence Organization, 7 2018, pp. 4159–4165. [Online]. Available:
B. LDA Topics Related to Code Generation on Twitter https://doi.org/10.24963/ijcai.2018/578
Table II illustrates the 17 topics inferred by the LDA model [17] Z. Sun, Q. Zhu, L. Mou, Y. Xiong, G. Li, and L. Zhang,
“A grammar-based structural cnn decoder for code generation,”
from the fine-toned ChatGPT’s code generation related tweets. Proceedings of the AAAI Conference on Artificial Intelligence,
We provided the first 40 words for each topic to demonstrate vol. 33, no. 01, pp. 7055–7062, Jul. 2019. [Online]. Available:
the most common words. Our analysis shows that ChatGPT https://ojs.aaai.org/index.php/AAAI/article/view/4686
[18] V. Raychev, M. Vechev, and E. Yahav, “Code completion with
has been utilized for various purposes in code generation, statistical language models,” in Proceedings of the 35th ACM
including debugging codes (topics 9, 13, and 17), testing SIGPLAN Conference on Programming Language Design and
codes/algorithms (topics 5 and 16), preparing for programming Implementation, ser. PLDI ’14. New York, NY, USA: Association
for Computing Machinery, 2014, p. 419–428. [Online]. Available:
interviews (topics 2 and 4), working on programming-related https://doi.org/10.1145/2594291.2594321
assignments (topics 3 and 6), and other related tasks. [19] Z. Sun, Q. Zhu, Y. Xiong, Y. Sun, L. Mou, and L. Zhang,
“Treegen: A tree-based transformer architecture for code generation,”
R EFERENCES Proceedings of the AAAI Conference on Artificial Intelligence,
vol. 34, no. 05, pp. 8984–8991, Apr. 2020. [Online]. Available:
[1] M. E. Kauffman and M. N. Soares, “Ai in legal services: new trends in ai- https://ojs.aaai.org/index.php/AAAI/article/view/6430
enabled legal services,” Service Oriented Computing and Applications, [20] M. Ciniselli, N. Cooper, L. Pascarella, D. Poshyvanyk, M. D. Penta,
vol. 14, no. 4, pp. 223–226, 2020. and G. Bavota, “An empirical study on the usage of BERT models
[2] R. Louie, A. Coenen, C. Z. Huang, M. Terry, and C. J. Cai, “Novice-ai for code completion,” CoRR, vol. abs/2103.07115, 2021. [Online].
music co-creation via ai-steering tools for deep generative models,” in Available: https://arxiv.org/abs/2103.07115
Proceedings of the 2020 CHI conference on human factors in computing [21] J. Baumgartner, S. Zannettou, B. Keegan, M. Squire, and J. Blackburn,
systems, 2020, pp. 1–13. “The pushshift reddit dataset,” in Proceedings of the international AAAI
[3] S. Barke, M. B. James, and N. Polikarpova, “Grounded copilot: How conference on web and social media, vol. 14, 2020, pp. 830–839.
programmers interact with code-generating models,” 2022. [Online]. [22] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,”
Available: https://arxiv.org/abs/2206.15000 Journal of machine Learning research, vol. 3, no. Jan, pp. 993–1022,
[4] P. Vaithilingam, T. Zhang, and E. L. Glassman, “Expectation vs 2003.
experience: Evaluating the usability of code generation tools powered [23] Y. Feng, Z. Lu, Z. Zheng, P. Sun, W. Zhou, R. Huang, and Q. Cao,
by large language models,” in Extended Abstracts of the 2022 CHI “Chasing total solar eclipses on twitter: Big social data analytics
Conference on Human Factors in Computing Systems, ser. CHI EA for once-in-a-lifetime events,” in 2019 IEEE Global Communications
’22. New York, NY, USA: Association for Computing Machinery, Conference (GLOBECOM). IEEE, 2019, pp. 1–6.
2022. [Online]. Available: https://doi.org/10.1145/3491101.3519665 [24] Y. Feng, D. Zhong, P. Sun, W. Zheng, Q. Cao, X. Luo, and Z. Lu, “Mi-
[5] M. Aljanabi, M. Ghazi, A. H. Ali, S. A. Abed et al., “Chatgpt: Open cromobility in smart cities: A closer look at shared dockless e-scooters
possibilities,” Iraqi Journal For Computer Science and Mathematics, via big social data,” in ICC 2021-IEEE International Conference on
vol. 4, no. 1, pp. 62–64, 2023. Communications. IEEE, 2021, pp. 1–6.
[6] L. Avila-Chauvet, D. Mejı́a, and C. O. Acosta Quiroz, “Chatgpt as a [25] M. Röder, A. Both, and A. Hinneburg, “Exploring the space of topic
support tool for online behavioral task programming,” Available at SSRN coherence measures,” in Proceedings of the eighth ACM international
4329020, 2023. conference on Web search and data mining, 2015, pp. 399–408.
[7] Y. Bang, S. Cahyawijaya, N. Lee, W. Dai, D. Su, B. Wilie, H. Lovenia, [26] J. Ni, J. Li, and J. McAuley, “Justifying recommendations using
Z. Ji, T. Yu, W. Chung et al., “A multitask, multilingual, multimodal distantly-labeled reviews and fine-grained aspects,” in Proceedings of
evaluation of chatgpt on reasoning, hallucination, and interactivity,” the 2019 Conference on Empirical Methods in Natural Language Pro-
arXiv preprint arXiv:2302.04023, 2023. cessing and the 9th International Joint Conference on Natural Language
[8] Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, Processing (EMNLP-IJCNLP), 2019, pp. 188–197.
T. Liu, D. Jiang et al., “Codebert: A pre-trained model for programming [27] T. Ziadé and I. Cordasco, “Flake8: Your tool for style guide enforcement.
and natural languages,” arXiv preprint arXiv:2002.08155, 2020. 2021,” URL: http://flake8. pycqa. org (besucht am 27. 05. 2019).
[9] A. Svyatkovskiy, S. K. Deng, S. Fu, and N. Sundaresan, “Intellicode [28] H. Chen, J. Chen, and H. Nguyen, “Demystifying covid-19 publications:
compose: Code generation using transformer,” in Proceedings of the institutions, journals, concepts, and topics,” Journal of the Medical
28th ACM Joint Meeting on European Software Engineering Conference Library Association: JMLA, vol. 109, no. 3, p. 395, 2021.
and Symposium on the Foundations of Software Engineering, 2020, pp.
1433–1443.

8
TABLE II
T HE EXTRACTED TOPICS USING THE LDA TOPIC MODEL

Rank Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8 Topic 9 Topic 10 Topic 11 Topic 12 Topic 13 Topic 14 Topic 15 Topic 16 Topic 17
1 capac softwar exam ture languag student haha nah code dey write error que login stabl test pour
2 bro engin school interview program cheat woke delet use song code network con simplifi diffus trade de
3 harder develop mba pass code teacher commit nobodi ask profess use van para rubi companion comput est
4 test job test test test essay broke everybodi write phase gener die una broken til free que
5 medium replac pass candid model assign annoy publicli work evolut ask een por battl ship money pa
6 shit code law flutter use malici partner judg tri glad creat het blender prime plot pay sur
7 unlock googl intellig tree algorithm school dear rout help temp test polit la helper discord softwar avec
8 premium use pa newslett gener educ steroid somebodi program suffer python advent lo flow jest paid le
9 reaction technolog artifici leap human use fli lambda time frequent program messag softwar tabl member use une
10 spin program professor conduct write malwar touch coffe good sad command occur del rap academia servic mai
11 eye tool wharton equival question detect outsourc curios question competitor prompt doom assembl dog leak crypto qui
12 tab think program holiday answer cybersecur citat farm answer semant video met como chain cite version cest
13 sat take stock fail ask kid bisa revolut know pseudo make profession pero odd bare bitcoin par
14 famili new univers extract data write appar recip bug alor websit alpha error skip wife code dan
15 tho search medic duck learn softwar greatest ticket give exponenti app overload python maker wallet price fair
16 write futur grade behav train teach disappoint nowaday learn cryptocurr content persist nut haiku investig sell fait
17 limerick write bar extent think colleg sweet encount fix aux post factual test framework nie cost test

9
18 holi year busi siri softwar homework parti respect problem yea want yall che conscious ive cloud jai
19 exploit go pose slide understand code untuk bother thing layoff idea android robot star rich algorithm plu
20 accept tech musk appl comput career white straight gener ride tweet neural todo revis laugh buy code
21 skynet stack educ ansibl text academ 2021 interestingli make cave twitter dat toy workout bill employe vou
22 asset peopl softwar obsess respons hacker sir imaginari python shame help subtl inject latex mom make tout
23 alexa amp cat sentient make program pump hahaha debug strang build readabl hacer extend elabor azur lui
24 lost make public doctor way secur vote grab explain exclus script apolog artifici analys unreal power comm
25 weekend test licens studio need plagiar yang irrelev error season tri men dia five visit open bir
26 ask product quantum entiti natur test anybodi peer googl ca work infinit pued watermark graduat compani son
27 drive overflow countri hunt see paper older certif even inner see pleas dune boom staff need mon
28 refresh way invest roll peopl univers threw cook test theori check scam sobr monitor strength program sui
29 aspect world startup examin creat attack third superior find tou new center algo observ cycl amazon bug
30 singular gener stream stress good new screw salari need pipelin text trump outlook associ item user peut
31 comprehens work tesla nation one hack footbal admin solv trop articl crawl crucial brother king token bien
32 casual see score januari tool gener as merg better donner tool whoever per coolest own million quil
33 travel chang firm sec googl threat crime club day ecosystem let kan monkey number coach sourc quand
34 server time lab rubber new puzzl clone showcas one streamlin blog friendli esta santiago gym way san
35 war help till alongsid convers creat bright dream realli nick bot contact muy press ace write gen
36 fuck learn learn premier machin crack woman counter exampl weather copi code tien sandbox ale posit non
37 steal skill lawyer matur base ban manner cup copilot evil thread ook ser linear delight invest bon
38 weird smart final satisfi amp email gui sleep want meanwhil imag net substanti resist god meta cett
39 ile contract world declar design develop numer winner actual mond give mental powerpoint screenshot sparrow go python
40 verbatim compani founder wisdom know concern worst upcom way aris chat insist hay atm boy billion ture

You might also like