ChatGPT Coding CompSac 23
ChatGPT Coding CompSac 23
1
keywords that are relevant to programming in the context of and moderate experience in using Copilot and IntelliSense.
ChatGPT, thus expanding the seed keyword of “ChatGPT.” By quantitative and qualitative analysis, they observed that
Using these expanded keywords, we retrieve 316K tweets and participants who used Copilot failed to complete tasks more.
3.2K Reddit posts related to ChatGPT’s code generation from Unlike the above case study works, we investigate a large
December 1, 2022, to January 31, 2023. dataset of feedback from users of the code generation tool
Furthermore, we conduct a comprehensive analysis using ChatGPT. Besides the user reactions, our study also examines
multimodal data (text and images) to answer the following the performance and limitations of ChatGPT.
research questions: As ChatGPT gains more attention recently, some researchers
1) What are the most popular programming languages in have studied its use for code generation [5]–[7]. Aljanabi et
ChatGPT usage? al. [5] discussed the possibility of using ChatGPT as a code
2) What programming scenarios, tasks, and purposes are generation tool. Avila et al. [6] have described how ChatGPT’s
people using ChatGPT for? programming capability can be used for developing online
3) What is the temporal distribution of the discussion on behavioral tasks, such as concurrent reinforcement schedules,
ChatGPT code generation? in HTML, CSS, and JavaScript code. In their work, they
4) How do stakeholders perceive ChatGPT code generation? created files with extensions .html, .css, and .js, which include
5) What are the prompts to generate code? the basic structure of the page, such as headings, linking
6) What is the quality of the code generated by ChatGPT? with styling elements, and other dynamic elements. In contrast
to the above works, we analyzed the performance of code
To the best of our knowledge, the proposed work represents
generation using ChatGPT.
the first systematic study on emerging generative models in
When it comes to complex coding, there is always a
code writing and testing. In this paper, we summarize our
chance of unidentified bugs, which may lead to a code crash.
contributions as follows:
Automated Program Repair (APR) is a concept introduced to
• We have proposed a scalable crowdsourcing and social provide automatic fixes for detected errors. In recent times,
data-driven framework for investigating the code genera- deep learning has enabled APR, and many tools using the con-
tion capabilities of ChatGPT. cept of Large Language Models (LLMs) with the Transformer
• We have presented a novel hybrid keyword expansion technique have been developed. LLMs are giving better results
method that incorporates words recommended by topic for many code-related tasks, and researchers have started to
modeling and experts to ensure that most of the related use them for APR [12]. Tools such as Codex, CodeBERT, and
social media posts are matched during data collection. more recently ChatGPT use LLMs for code fixing.
• Our study considers multiple social media platforms What makes ChatGPT stand out is its ability to discuss the
(Twitter and Reddit) and multimodal data (text and im- source code with user interaction. Sobania et al. (2023) [13]
age) to mitigate potential biases caused by a single data conducted an experiment to test the efficiency of ChatGPT in
source or data type. bug fixing compared to other tools like Codex and CoCoNut.
• We have provided data analytics from multiple perspec- About 40 of QuixBugs benchmark problems containing erro-
tives, including topic inference, sentiment analysis, and neous code were given to ChatGPT to provide solutions. The
data quality measurement. experiment showed that ChatGPT’s performance is similar to
• We have built a real-world programming dataset contain- other APR tools like Codex. However, when given more infor-
ing the ChatGPT prompt and the associated generated mation about the problem through its dialogue box, ChatGPT’s
code. This dataset will be released to the entire software performance improved, with a success rate of 77.5%.
engineering community soon. Although ChatGPT works fine with simple code logic, it can
be challenging to describe complex needs, such as designing
II. R ELATED W ORK
a web browser where user needs should be satisfied, with the
As programming generation and assistant tools, such as simple, machine-readable instructions that ChatGPT or other
CodeBERT [8] and IntelliCode Compose [9], become more AI tools use to produce code [14]. Chatterjee and Dethlefs [15]
widely used, there has been an increased focus on investi- have claimed that code generated by ChatGPT is also gender
gating the usability and interactions between users and code and race biased, questioning the efficiency of the model.
generation tools [3], [4], [10], [11]. For example, Barke et Previous works and tools for automated code generation
al. identified two interaction modes between programmers mostly relied on using neural networks [16], [17], which
and code generation tools: acceleration mode and exploration cannot match the current performance of GPT-3 models. Ray-
mode, by observing how 20 programmers solved various tasks chev et al. [18] proposed a code completion technique using
using the code generation tool Copilot [3]. Although this statistical language models to discover highly rated sentences
paper discussed a grounded theory of how programmers might and recommend code completion suggestions. Sun et al. [19]
interact differently with Copilot, it only investigated a limited introduced a novel tree-based neural architecture that incorpo-
number of users and did not reveal the reactions from users. rates grammar rules and AST structures into the network, and
Vaithilingam et al. [4] performed a study on 24 partici- it has been shown to have the best accuracy among all neural
pants consisting of different groups of people with minimal network-based code generation methods. Ciniselli et al. [20]
2
conducted a detailed empirical study on BERT models for code Specifically, we leveraged Twitter Streaming APIs to sample
completion and evaluated the percentage of perfect predictions tweet streams containing the keyword “ChatGPT” for over 55
that match developer-written code snippets. However, while hours. In total, we collected 158,452 tweets, including original
BERT models offer a potential solution for code completion, tweets, retweets, and replies. After removing duplicate tweets,
their performance is lower compared to LLM models such as we had 63,716 unique tweets. We then applied the LDA model
Copilot and ChatGPT. to infer topics based on these unique tweets, with the hope
In summary, there is a scarcity of research that delves into of discovering programming-related topics. We evaluated the
the applications of AI-assisted code generation tools. ChatGPT number of topics ranging from 1 to 30 and found that the
has risen as a prevalent option among these tools. The current convergence score achieved a relatively high and stable value
study aims to assess the effectiveness of ChatGPT as a code with the number of topics set as 22. For more details, please
generation tool. To our knowledge, this is the first research see Figure 10. After examining the 22 topics, we identified
that employs a dataset from social media to evaluate the one of them as “Programming,” consisting of the following
performance of a code generation tool. words: ask, stack, knew, write, error, diffus, run, python, stabl,
scientist, email, straight, shock, gener, comput, command, use,
III. M ETHOD code, notic, brain, bug, statement, think, dead, question, admit,
In this section, we present the proposed data-driven gen- happen, result, and overflow.
erated coding investigation framework by introducing how Combining the words in the topic of Programming, we come
to collect data of interest, how to analyze data, and how to up with the following keyword list – algorithm, algorithms,
interpret findings. bug, bugs, c#, c++, code, coding, command, commands, com-
piler, computing, debug, debugging, error, go, interpreter, java,
A. Framework Overview javascript, libraries, php, program, programming, python, r,
The overview of the proposed framework is illustrated Ruby, shell, software, sql, stack overflow, swift, test, testing,
in Figure 2. It consists of three key components: Keyword typescript – to crawl ChatGPT related code generation posts.
Selection, Data Collection, and Data Analytics. Unlike tradi-
tional user study-based research, the crowdsourcing platform C. Data Collection
is designed to be flexible and scalable, enabling the study of Based on the above carefully curated keywords, we leverage
a large population over a long period of time. We will delve two social media platforms, i.e., Twitter and Reddit to collect
into each component in detail, examining the performance of data for further analytics.
LLMs in code generation. 1) Twitter Data: Instead of relying on Twitter Streaming
B. Keyword Selection for Software Development APIs, we opt to use the Twitter Historical Data Search APIs
to create our Twitter dataset for the following reasons: 1)
To ensure the quality of the collected data, we employ a The streaming data is time-sensitive, making it impossible to
hybrid approach that combines data-driven keyword expansion retrieve older data from the debut of ChatGPT if the streaming
and expert-based keyword selection. This approach ensures data collection was not be launched at that time; 2) If we
that the data is comprehensive and precise, eliminating the risk only investigate the latest data (e.g., after Feb 1, 2023), it
of bias or incompleteness in the selection of query keywords. will introduce bias as we do not know when the performance
As ChatGPT is one of the most popular LLMs that supports of ChatGPT’s code generation was most widely discussed on
code generation, we use ChatGPT as the seed keyword to social media. On the other hand, the historical tweets span
sample Twitter streams, harvesting tweets that mention this the entire evolution of ChatGPT and provide a sample of user
term. We then perform topic modeling to determine whether comments since its release, enhancing the representativeness
a coding-related topic is present. If a coding-related topic and completeness of the crowdsourced opinions and feedback.
is observed, we add the words belonging to this topic to Twitter provides two historical data search APIs, i.e., 30-
the expanded keyword set. If a coding-related topic is not Day Search API1 that allows for retrieval of the last 30
observed, we conduct a co-occurrence word analysis and days and the Full-archive Search API2 that provides tweets
calculate the semantic similarity with the word coding to since 2006 when the first tweet was posted. Since ChatGPT
expand the candidate keywords. was released on November 30, 2022, we choose the Full-
However, the data-driven keyword expansion method may archive Search API to harvest the data. Specially, we use
result in false positives, i.e., keyword candidates that may Twitter’s Academic Research API, which supports full-archive
not be relevant to AI-based code generation. Therefore, we tweet search, to retrieve ChatGPT-related data from November
manually examine all recommended keyword candidates to 30, 2022 to February 1, 2023, with the query configured to
ensure the quality of the collected data. We first filter out only cover English tweets and exclude retweets by setting “-
irrelevant keywords and propose multiple combinations of is:retweet lang:en.” In addition to the tweet text, we collect
keywords to control the precision of data collection. For related Twitter media information (e.g., posted images) to
example, instead of collecting all postings containing Chat-
GPT, collecting postings containing both ChatGPT and coding 1 https://tinyurl.com/2s4xt8r7
3
Keyword Expansion Data Collection Data Analytics
Keywords
Examine Keywords in of the discussion on ChatGPT
code generation?
Coding Topic
How do stakeholders perceive
Image ChatGPT code generation?
Image Data
Data Understanding What are the prompts to generate
Generate Keywords by Processing code?
Experts What is the quality of the code
Expert Pushshift Reddit API Reddit Data generated by ChatGPT?
Fig. 2. Overview of the proposed crowdsourcing framework to investigate the programming capabilities of LLMs
support fine-grained analysis. For this study, we collected Given that Twitter allows users to utilize #hashtags to
316K tweets posted between Dec. 1, 2022, and Jan. 31, 2023. indicate related topics and enhance visibility through searches,
2) Reddit Data: Unlike Twitter, where the structure is we also present the distribution of #hashtags in the collected
based on users following one another, Reddit is structured tweets. However, as #hashtags are rarely used on Reddit, we
around communities where posts on similar topics are grouped do not perform this analysis for Reddit submissions.
together. These communities are referred to as “subreddits” 2) Image Understanding: As ChatGPT is a text generative
on Reddit. For instance, the subreddit /r/aww is a community model, it is expected that most images related to ChatGPT,
where users share cute and cuddly pictures. The initial posts particularly those related to code generation, posted on so-
on Reddit are known as “submissions,” and the responses to cial media will be text-rich. To make these images more
these posts are called “comments.” informative and easier to process for downstream tasks, we
To assess the performance of ChatGPT in code gen- suggest using an Optical Character Recognition (OCR) based
eration on Reddit, we concentrate on four well-known approach to convert the collected images into text. We apply
subreddits, namely /r/ChatGPT, /r/coding, /r/github, and multiple OCR methods, including OpenCV-based pytesseract3
/r/programming. To gather submissions from these subred- and deep learning-based easyOCR4 , to the collected image
dits, we use the Search Reddit Submissions Endpoint (/red- dataset. After thoroughly evaluating the OCR detection results,
dit/search/submission) through the Pushshift Reddit API [21]. we choose easyOCR to identify and extract text from the
Similar to Twitter data, multimedia data including images images accurately.
in Reddit submissions are also retrieved. For this study, we 3) Code Reconstruction from Image: To reconstruct the
collected 3.2K Reddit submissions posted between Dec. 1, code generated by ChatGPT, it is crucial to identify the images
2022, and Jan. 31, 2023, to analyze the code generation that contain generated code. After examining the screenshots
performance of ChatGPT. of coding snippets, we found that all ChatGPT-generated code
snippets contained the keyword “Copy code” in the top-right
D. Data Analytics and Pattern Recognition corner of the coding block, as shown in Figure 1. Therefore,
We primarily employ natural language processing and image we selected all images containing the ”Copy code” keyword
understanding techniques to analyze text and image data to for further analysis.
uncover insights and identify patterns. We proposed two methods to recover the code generated
1) Text Based Topic Discovery: To obtain a comprehensive by ChatGPT. The first one is to extract the code directly
understanding of the use of ChatGPT in code generation on from the OCR results. We found that it is crucial to address
Twitter and Reddit, we employ latent Dirichlet allocation any indentation issues for indentation-sensitive programming
(LDA) [22], a widely used topic modeling technique, to languages, such as Python, as a high percentage of errors can
uncover hidden topics in the collected tweets and Reddit occur due to improper indentation. However, automatically in-
submissions. We treat each tweet or submission content as denting any given code can be a complex and challenging task.
an individual document and the entire collection of tweets A simple script that looks for loops and specific statements to
or submissions as the corpus of documents. In the text pre- increase and decrease the indentation count does not work on
processing stage, we implement commonly used techniques all codes, especially if the code has multiple indentation styles
such as removing stop words and frequently occurring words and conditional statements.
like “ChatGPT,” tokenization, and lemmatization of words. We An alternative method to obtain the code is reproducing it
then perform term frequency-inverse document frequency (TF- using the identical prompt. Specifically, we can identify the
IDF) on the combined documents to form a TF-IDF-based cor- prompt and input it into ChatGPT web services5 to generate
pus, on which latent topics are extracted using LDA models. the code. Once we have downloaded the newly produced code,
Following previous research on big social data analysis [23], we can assess and evaluate it. In our study, we adopted this
[24], we select the Cv metric to determine the appropriate reliable method to reconstruct the code generated by ChatGPT.
number of topics. This metric is known to be one of the 3 https://pypi.org/project/pytesseract/
best coherence measures as it combines normalized pointwise 4 https://github.com/JaidedAI/EasyOCR
4
4) Sentiment Analysis: Considering that ChatGPT may B. Topics Related to Code Generation
trigger diverse emotions in code generation, we do not think We generated topics for the tweets containing keyword
the three categories of positive, negative, and neutral can “ChatGPT” and programming related words using the LDA
cover all involved emotions. In order to accurately reflect model. Based on the coherence score presented in Figure 4,
the various and complex emotions expressed in social media we select 17 topics finally. The 17 topics and the word list
users’ comments, we choose to categorize them into more of each topic are presented in Table II (see Appendix B).
inclusive emotions: Happy, Angry, Sad, Surprise, and Fear. To The topic modeling results indicate that ChatGPT has been
achieve this, we utilize Text2Emotion [26], a Python package used for different purposes regarding code generation, such
capable of analyzing sentiments and categorizing them into as debugging codes (topic 9, topic 13, topic 17), testing
the aforementioned five emotions. codes/algorithms (topic 5, topic 16), preparing programming
5) Code Quality Evaluation: To assess the quality of the interview (topic 2 and topic 4), working on programming-
code generated by ChatGPT, we are utilizing Flake8 [27], related assignments (topic 3, topic 6), and other related tasks.
which is a wrapper around PyFlakes, pycodestyle, and Ned Twitter users also notice that ChatGPT’s capacity in code
Batchelder’s McCabe script. Flake8 allows the use of any generation is limited (topic 1). Still, the ethic issues and social
of these tools by launching Flake8, and it assigns a unique responsibility aspect of ChatGPT have not been discussed
code number to each error code. The output of warnings and much among users.
errors is displayed per file. We choose Flake8 as our evaluation
tool because it is one of the most powerful and flexible 0.55
Coherence score Cv
tools available, providing a wide range of error codes while 0.50
remaining fast to run checks. Flake8 is particularly effective 0.45
when checking for correctness and whitespace issues, making 0.40
it an ideal choice for our purposes. 0.35
IV. E VALUATION AND F INDINGS 0.30
#ArtificialIntelligence
#Technology
#AI
#OPENAI
#Programming
#Python
#Coding
#ChatBOT
#GPT3
#ChatGPT3
#TECH
#CODE
#MachineLearning
#GOOGLE
#ML
#WEB3
#DATASCIENCE
#JavaScript
#Crypto
#SEO
#Innovation
#MICROSOFT
#OpenAIChatGPT
#CyberSecurity
#100DaysOfCode
#Developer
#Software
#GenerativeAI
#BITCOIN
0.2
0.0
c#
c++
java
php
javascript
python
r
ruby
shell
sql
swift
typescript
C. Temporal Distribution
Fig. 3. Programming language distribution Temporal analysis can be used to examine the popularity
over time. Figure 6 visualizes the daily distribution of posts on
5
Twitter (blue) and Reddit (yellow) related to ChatGPT’s code TABLE I
generation in the first two months after its launch. ChatGPT C ODE QUALITY RESULTS BY F LAKE 8
discussion spread faster on Twitter than Reddit. We observe a
Code Description Percentage
peak of the ChatGPT code generation on Twitter and Reddit
at the end of the first week of the release of ChatGPT. The E501 line too long (114 >79 characters) 29.39%
E231 missing whitespace after ’,’ 15.54%
popularity decreased from the second week, but somehow still E302 expected 2 blank lines, found 1 13.51%
very popular on both platform. Even after two months, the W293 blank line contains whitespace 10.14%
attention on ChatGPT is still stable, indicating ChatGPT is E402 module level import not at top of file 5.41%
E305 expected 2 blank lines after class, found 1 4.73%
helpful for code generation. E265 block comment should start with ’# ’ 4.39%
W291 trailing whitespace 2.70%
E999 SyntaxError: invalid syntax 2.36%
4 Twitter E227 missing whitespace around bitwise or shift operator 1.69%
Reddit
PDF(%)
6
Happy Happy Happy Happy
0.6 0.6 0.6 0.6
Fear 0.2 Angry Fear 0.2 Angry Fear 0.2 Angry Fear 0.2 Angry
Twitter Twitter Twitter Twitter
Reddit Reddit Reddit Reddit
Fear 0.2 Angry Fear 0.2 Angry Fear 0.2 Angry Fear 0.2 Angry
Twitter Twitter Twitter Twitter
Reddit Reddit Reddit Reddit
Fig. 7. Sentiment analysis results on code generation for eight programming languages
Fig. 8. WordCloud of prompts Fig. 9. A sample in the public dataset of prompts (Line 3) and generated
code (Line 4 - Line 17)
7
score with the number of topics set differently. Since either [10] E. Jiang, E. Toh, A. Molina, K. Olson, C. Kayacik, A. Donsbach, C. J.
the selected number of topics (k) is too big (i.e., k > 30) Cai, and M. Terry, “Discovering the syntax and strategies of natural lan-
guage programming with generative language models,” in Proceedings
or too small (i.e., k<5) will make the topic interpretation of the 2022 CHI Conference on Human Factors in Computing Systems,
problematic, we finally selected 22 topics for the highest 2022, pp. 1–19.
coherence score between 5 to 30 topics. [11] F. F. Xu, B. Vasilescu, and G. Neubig, “In-IDE code generation
from natural language: Promise and challenges,” ACM Transactions on
Software Engineering and Methodology (TOSEM), vol. 31, no. 2, pp.
0.50 1–47, 2022.
Coherence score Cv
8
TABLE II
T HE EXTRACTED TOPICS USING THE LDA TOPIC MODEL
Rank Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8 Topic 9 Topic 10 Topic 11 Topic 12 Topic 13 Topic 14 Topic 15 Topic 16 Topic 17
1 capac softwar exam ture languag student haha nah code dey write error que login stabl test pour
2 bro engin school interview program cheat woke delet use song code network con simplifi diffus trade de
3 harder develop mba pass code teacher commit nobodi ask profess use van para rubi companion comput est
4 test job test test test essay broke everybodi write phase gener die una broken til free que
5 medium replac pass candid model assign annoy publicli work evolut ask een por battl ship money pa
6 shit code law flutter use malici partner judg tri glad creat het blender prime plot pay sur
7 unlock googl intellig tree algorithm school dear rout help temp test polit la helper discord softwar avec
8 premium use pa newslett gener educ steroid somebodi program suffer python advent lo flow jest paid le
9 reaction technolog artifici leap human use fli lambda time frequent program messag softwar tabl member use une
10 spin program professor conduct write malwar touch coffe good sad command occur del rap academia servic mai
11 eye tool wharton equival question detect outsourc curios question competitor prompt doom assembl dog leak crypto qui
12 tab think program holiday answer cybersecur citat farm answer semant video met como chain cite version cest
13 sat take stock fail ask kid bisa revolut know pseudo make profession pero odd bare bitcoin par
14 famili new univers extract data write appar recip bug alor websit alpha error skip wife code dan
15 tho search medic duck learn softwar greatest ticket give exponenti app overload python maker wallet price fair
16 write futur grade behav train teach disappoint nowaday learn cryptocurr content persist nut haiku investig sell fait
17 limerick write bar extent think colleg sweet encount fix aux post factual test framework nie cost test
9
18 holi year busi siri softwar homework parti respect problem yea want yall che conscious ive cloud jai
19 exploit go pose slide understand code untuk bother thing layoff idea android robot star rich algorithm plu
20 accept tech musk appl comput career white straight gener ride tweet neural todo revis laugh buy code
21 skynet stack educ ansibl text academ 2021 interestingli make cave twitter dat toy workout bill employe vou
22 asset peopl softwar obsess respons hacker sir imaginari python shame help subtl inject latex mom make tout
23 alexa amp cat sentient make program pump hahaha debug strang build readabl hacer extend elabor azur lui
24 lost make public doctor way secur vote grab explain exclus script apolog artifici analys unreal power comm
25 weekend test licens studio need plagiar yang irrelev error season tri men dia five visit open bir
26 ask product quantum entiti natur test anybodi peer googl ca work infinit pued watermark graduat compani son
27 drive overflow countri hunt see paper older certif even inner see pleas dune boom staff need mon
28 refresh way invest roll peopl univers threw cook test theori check scam sobr monitor strength program sui
29 aspect world startup examin creat attack third superior find tou new center algo observ cycl amazon bug
30 singular gener stream stress good new screw salari need pipelin text trump outlook associ item user peut
31 comprehens work tesla nation one hack footbal admin solv trop articl crawl crucial brother king token bien
32 casual see score januari tool gener as merg better donner tool whoever per coolest own million quil
33 travel chang firm sec googl threat crime club day ecosystem let kan monkey number coach sourc quand
34 server time lab rubber new puzzl clone showcas one streamlin blog friendli esta santiago gym way san
35 war help till alongsid convers creat bright dream realli nick bot contact muy press ace write gen
36 fuck learn learn premier machin crack woman counter exampl weather copi code tien sandbox ale posit non
37 steal skill lawyer matur base ban manner cup copilot evil thread ook ser linear delight invest bon
38 weird smart final satisfi amp email gui sleep want meanwhil imag net substanti resist god meta cett
39 ile contract world declar design develop numer winner actual mond give mental powerpoint screenshot sparrow go python
40 verbatim compani founder wisdom know concern worst upcom way aris chat insist hay atm boy billion ture