4

This paper investigates the explainability of large language models (LLMs) in code summarization by comparing human attention, measured through eye-tracking, with model focus, assessed using SHAP (SHapley Additive exPlanations). The study finds no significant alignment between the focus of language models and human programmers, suggesting that current methods may not effectively capture how models reason about code. The results highlight the need for further research into model interpretability and the relationship between human and machine attention in software engineering tasks.

Uploaded by

Swayam Pande

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

4

Uploaded by

Swayam Pande

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

2024 IEEE/ACM 32nd International Conference on Program Comprehension (ICPC)

Do Machines and Humans Focus on Similar Code? Exploring

Explainability of Large Language Models in Code Summarization
Jiliang Li Yifan Zhang Zachary Karas
Vanderbilt University Vanderbilt University Vanderbilt University
Nashville, Tennessee, USA Nashville, Tennessee, USA Nashville, Tennessee, USA
[email protected] [email protected] [email protected]

Collin McMillan Kevin Leach Yu Huang

University of Notre Dame Vanderbilt University Vanderbilt University
Notre Dame, Indiana, USA Nashville, Tennessee, USA Nashville, Tennessee, USA
[email protected] [email protected] [email protected]

ABSTRACT KEYWORDS
Recent language models have demonstrated proficiency in summa- Neural Code Summarization, Language Models, Explainable AI,
rizing source code. However, as in many other domains of machine SHAP, Human Attention, Eye-Tracking
learning, language models of code lack sufficient explainability —
informally, we lack a formulaic or intuitive understanding of what ACM Reference Format:
and how models learn from code. Explainability of language models Jiliang Li, Yifan Zhang, Zachary Karas, Collin McMillan, Kevin Leach, and Yu
can be partially provided if, as the models learn to produce higher- Huang. 2024. Do Machines and Humans Focus on Similar Code? Exploring
quality code summaries, they also align in deeming the same code Explainability of Large Language Models in Code Summarization. In 32nd
parts important as those identified by human programmers. In this IEEE/ACM International Conference on Program Comprehension (ICPC ’24),
April 15–16, 2024, Lisbon, Portugal. ACM, New York, NY, USA, 5 pages.
paper, we report negative results from our investigation of explain-
https://doi.org/10.1145/3643916.3644434
ability of language models in code summarization through the lens
of human comprehension. We measure human focus on code using
eye-tracking metrics such as fixation counts and duration in code
summarization tasks. To approximate language model focus, we 1 INTRODUCTION
employ a state-of-the-art model-agnostic, black-box, perturbation- Recent language models for code have shown promising perfor-
based approach, SHAP (SHapley Additive exPlanations), to identify mance on several code-related tasks [31]. Among these tasks is
which code tokens influence that generation of summaries. Using neural code summarization, where a language model generates a
these settings, we find no statistically significant relationship be- short natural language summary describing a given code snippet.
tween language models’ focus and human programmers’ attention. This is often an indicative task demonstrating a model’s ability
Furthermore, alignment between model and human foci in this to comprehend code. Currently, the majority of assessments for
setting does not seem to dictate the quality of the LLM-generated how well a language model understands code directly measures the
summaries. Our study highlights an inability to align human focus quality of code summaries generated by the models, and compares
with SHAP-based model focus measures. This result calls for future them with human-written summaries [31]. Comparatively little is
investigation of multiple open questions for explainable language known about why and how the language models reason about code
models for code summarization and software engineering tasks in to generate such summaries. Similar to many other downstream
general, including the training mechanisms of language models for domains of machine learning in software engineering, understand-
code, whether there is an alignment between human and model ing and explaining how and why language models for code work
attention on code, whether human attention can improve the devel- (or fail) is critical to improving model architecture, reducing bias,
opment of language models, and what other model focus measures and preventing undesirable model behavior.
are appropriate for improving explainability. Human programmers typically achieve a strong understanding
of code. Thus, proficient language models might be explained if
CCS CONCEPTS they focus on the same parts of code that humans would [21]. Eye-
• Computing methodologies → Artificial intelligence. tracking studies have been conducted to analyze programmers’
visual patterns while reading code [2, 22]. Specifically, the duration
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed and frequency of a programmer’s eye gaze on a part of code in a
for profit or commercial advantage and that copies bear this notice and the full citation spatially-stable manner, referred to as fixation duration and fixation
on the first page. Copyrights for components of this work owned by others than the count respectively, are indicative of cognitive load [24]. Thus, these
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission measures of eye-tracking can indicate the parts of code on which
and/or a fee. Request permissions from [email protected]. human programmers focus. In contrast, there is a lack of consensus
ICPC ’24, April 15–16, 2024, Lisbon, Portugal on how to measure a language model’s reasoning about code (see
© 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 979-8-4007-0586-1/24/04. . . $15.00 Section 2.2). Most existing works extract the self-attention layers in
https://doi.org/10.1145/3643916.3644434 language models for code to measure the model attention [21, 20, 9].

47
ICPC ’24, April 15–16, 2024, Lisbon, Portugal Jiliang Li, Yifan Zhang, Zachary Karas, Collin McMillan, Kevin Leach, and Yu Huang

Such methods require direct access to the internal layers of a lan- In contrast, state-of-the-art black-box approaches like SHAP [15]
guage model, limiting the possibility to investigate interpretability (SHapley Additive exPlanations) apply game-theoretic principles
of many state-of-the-art proprietary models (e.g., ChatGPT). to assess the impact of input variations on a model’s output. SHAP
In this paper, to investigate how proprietary language models evaluates the effects of different combinations of input features
reason about code, we employ a state-of-the-art perturbation-based — such as tokens in a text sequence — by observing how their
method, SHAP [15] (SHapley Additive exPlanations), that treats presence or absence (simulated by token masking) alters the model’s
each language model as a black-box function. With SHAP, we an- prediction from an expected result. This process helps to ascertain
alyze the feature attribution (i.e., which parts of code are deemed the relative contribution of each feature to the output, allowing for
important by the model) in six different state-of-the-art language an analysis of the model without requiring access to its internal
models for code. We use a set of Java methods to task both the architecture [14, 11]. In this paper, to investigate proprietary models,
language models and human programmers with writing code sum- we employ SHAP to measure where language models focus on code.
maries. The feature attribution in the language models, measured
by SHAP, is then compared with human developers’ focus, col- 2.3 Comparing Human vs. Machine Attention
lected from eye-tracking. We hypothesize that sufficiently large Previous papers have examined the alignment between human
models may learn to focus on parts of code similarly to humans. If and model attention in code comprehension tasks. Paltenghi et
validated, language model behavior can thus be described in terms al. [20] found that CodeGen’s [17] self-attention layers attend to
of human behavior, ultimately helping to explain and improve similar parts of code compared to human programmers’ visual
language models. However, we find that explainability cannot be fixations when answering comprehension questions about code.
provided through this lens and find no statistically significant evi- Similarly, Huber et al. [9] discovered overlaps in attention patterns
dence suggesting the hypothesized alignment. Furthermore, we did between neural models and humans when repairing buggy pro-
not find that language models’ focus exhibits a statistically signifi- grams. Notably, Paltenghi and Prasdel [21] compared language
cant correlation with human focus in general. For future research models’ self-attention weights and humans’ visual attention during
that aims to explore the explainability of language models for code code summarization. They found that model attention, measured
summarization, especially for those leveraging human attention, by self-attention weights, does not align well with human attention.
our findings might suggest the following: (1) though widely used However, this work is limited by investigating only small CNN and
in AI, SHAP may not be an optimal method to investigate where transformer models. Most importantly, all aforementioned studies
language models focus during code summarization, or alternatively, used white-box approaches towards interpretability of open-source
(2) a misalignment between language models and human develop- models, limiting applicability to state-of-the-art proprietary models.
ers in reasoning about code may provide insights for improving AI Recently, Kou et al. [11] utilized both white-box and black-box
models for code summarization. perturbation-based approaches to measure LLMs’ focus in code
generation tasks, and discovered a consistent misalignment with
2 BACKGROUND AND RELATED WORK humans’ attention. In general, these works have demonstrated that
2.1 Neural Models for Code Summarization whether human and machine attention align depends heavily on
the methods employed to approximate machine focus, as well as
Advancements in deep learning have enabled machine learning
the specific code comprehension task examined. In this paper, we
models to generate summaries for source code. Among the state-
build upon former works by examining whether human attention
of-the-art models, NeuralCodeSum (NCS) first introduced the use
correlates with feature attribution in language models, measured by
of Transformers in neural code summarization [3]. With the rise
a black-box perturbation-based approach, in code summarization.
of large language models (LLMs), ServiceNow and HuggingFace
released a 15.5B parameter LLM for code, StarCoder [13], and Meta
released a 7B parameter LLM, Code LLama [23], both of which can 3 EXPERIMENTAL DESIGN
serve to summarize code. Although not inherently an LLM for code, 3.1 Measuring Human Visual Focus
GPT3.5 [18] and GPT4 [19] are also capable of code summarization. We used eye-tracking data measuring human attention from a con-
In this paper, we investigate how all the aforementioned models trolled human study with 27 programmers. The study obtained
reason about code when tasked to generate code summaries. IRB approval, and asked participants to read Java methods and
write accompanying summaries [4]. Each participant summarized
2.2 Interpretability of Language Models 24–25 Java methods from the FunCom dataset [12], yielding 671
Existing works on interpretable language models generally seek to trials of eye-tracking data in total. Considering data quality, two
investigate the relative importance of each input token for model authors with five and eight years of Java experience cooperatively
performance [29, 8, 16]. Such works can be commonly categorized removed participant data associated with five summaries that did
into two types: white-box vs. black-box. White-box approaches not demonstrate an understanding of the Java code.
require access to a language model’s internal layers [25, 28], often In this work, we sought to measure where humans and language
directly investigating the self-attention scores in Transformer-based models focus on code as they summarize it. We first used the srcML
models [7, 33, 32]. However, Transformer-based models’ inherent parser to convert each Java method into its corresponding Abstract
complexity has led to a lack of consensus on how to aggregate Syntax Tree (AST) representation [6]. The AST provides structural
attention weights [30, 35, 33]. For the general research community, context for each token literal (i.e., ‘Hello World’ →
− String Literal).
white-box approaches preclude proprietary models (e.g., ChatGPT). With the gaze coordinates collected from the eye-tracker [1], we

48
Do Machines and Humans Focus on Similar Code? Exploring Explainability of Large Language Models in Code Summarization ICPC ’24, April 15–16, 2024, Lisbon, Portugal

measured humans’ focus on each AST token. Typically, researchers sources, we iterate through each Java method and calculate the
use fixations to quantify human visual focus [24]. A fixation is de- Spearman’s rank coefficient (𝜌) [27] between the two sources’ vec-
fined as a spatially stable eye-movement lasting 100–300ms. Most tors for that method. Then, for each pair of focus sources, we report:
cognitive processing occurs during fixations [24], so researchers (1) The mean and standard deviation of Spearman’s 𝜌 across all Java
consider their frequency and duration in making inferences about methods where correlation is statistically significant (𝑝 ≤ 0.05),
human cognition. In our analyses, we computed the average count and (2) the proportion of Java methods demonstrating a statistically
and duration of programmers’ fixations on each AST token. Conse- significant correlation (𝑝 ≤ 0.05).
quently, for each Java method, we obtained two visual focus vectors In addition, we group all AST tokens into 18 semantic categories
with lengths equal to the number of AST tokens, respectively, which (e.g., method call, operator, etc.) and investigate how much hu-
represent fixation counts and durations on each token1 . mans3 and language models focus on each semantic category. The
focus score assigned to each semantic category is the sum of the
3.2 Measuring Model Focus focus scores assigned to each AST token belonging to that seman-
As mentioned in Section 2.2, we choose SHAP’s official, default tic category. To counter biases where certain semantic categories
implementation of the TeacherForcing method to measure feature contain more AST tokens or appear more frequently, we report
attribution in language models, treating each as a black-box func- the relative difference between machine and human foci for each
tion. For each language model, we pass in each of the 68 Java semantic category. That is, we average the six language models’
𝑓 𝑜𝑐 −𝑓 𝑜𝑐ℎ𝑢𝑚𝑎𝑛
methods (also read by human programmers) as input, along with focus scores per category and report | 𝑚𝑎𝑐ℎ𝑖𝑛𝑒 𝑓 𝑜𝑐 |.
ℎ𝑢𝑚𝑎𝑛
necessary prompting for the model to output summaries of source
3.3.2 RQ2. Here, a human expert provides quality ratings for sum-
code. For each Java method passed into each language model, we
maries generated by each language model for every Java method
let 𝑖 denote an input token (in code) and 𝑜 denote an output token
using four criteria: accuracy, completeness, conciseness, and read-
(in summary). For each (𝑖, 𝑜) pair, SHAP produces an importance
ability. Next, we calculate the Spearman’s 𝜌 between each language
score, denoted 𝑣 (𝑖,𝑜 ) , signifying how much 𝑖’s presence or absence
model’s focus vector and the human fixation duration vector across
alters the presence of 𝑜. Then, the importance score of each input
Í all Java methods where correlation is significant (𝑝 ≤ 0.05). We
token2 , 𝑣𝑖 , is calculated such that 𝑣𝑖 = 𝑜 |𝑣 (𝑖,𝑜 ) |. Note that now 𝑣𝑖
then append all such statistically significant 𝜌’s to form a vector, de-
is associated with a language model token, and each AST token may
noted 𝑣𝑐𝑜𝑟 , to represent the degrees of alignment between machine
consist of several language model tokens. Thus, for each AST token,
and human foci across the Java methods investigated4 .
we calculate its score 𝑛1 𝑛𝑗=1 𝑣 𝑗 , where 𝑣 1, · · · , 𝑣𝑛 are scores of lan-
Í
Subsequently, we determine whether this alignment is correlated
guage model tokens constituting the AST token. Consequently, for
with the rated quality of summaries. Specifically, we construct four
each language model on each Java method, we obtain a focus vector
other vectors, {𝑣𝑎𝑐𝑐 , 𝑣𝑐𝑜𝑚 , 𝑣𝑐𝑜𝑛 , 𝑣𝑟𝑒𝑎 }, containing the accuracy, com-
(with a length equal to the number of AST tokens) representing
pleteness, conciseness, and readability scores respectively. At each
how influential each AST token is to the model.
index 𝑖, {𝑣𝑐𝑜𝑟 [𝑖], 𝑣𝑎𝑐𝑐 [𝑖], 𝑣𝑐𝑜𝑚 [𝑖], 𝑣𝑐𝑜𝑛 [𝑖], 𝑣𝑟𝑒𝑎 [𝑖]} are respectively
In total, we investigated the model focus of six different models:
the Spearman’s 𝜌, summary accuracy, completeness, conciseness,
GPT4, GPT-few-shot, GPT3.5, StarCoder, Code Llama, and NCS.
and readability of the same language model applied on the same
Here, GPT-few-shot is a GPT3.5 model, but in an attempt for the
Java method. We then measure and report the Spearman’s rank
model to produce code summaries more similar to those of humans,
correlation between 𝑣𝑐𝑜𝑟 and 𝑣𝑖 , where 𝑣𝑖 ∈ {𝑣𝑎𝑐𝑐 , 𝑣𝑐𝑜𝑚 , 𝑣𝑐𝑜𝑛 , 𝑣𝑟𝑒𝑎 }.
we used few-shot prompting to instruct the model to provide sum-
maries similar to two randomly selected human-written summaries.
4 RESULTS
The other five state-of-the-art LLMs are introduced in Section 2.1
and implemented with their default parameters. 4.1 RQ1: General Correlation
As shown in Table 1, there is a general lack of correlation between
3.3 Comparing Human and Model Foci human and machine foci. We highlight that the means and standard
For brevity, we refer to two human visual focus measurements deviations in Table 1 are only calculated from Java methods where
(i.e., fixation duration and count) and six language models as eight the Spearman’s 𝜌 is statistically significant (with 𝑝 ≤ 0.05). In
"focus sources." For each source, we obtained 68 focus vectors, each practice, between any pair of human-LLM focus sources, at most
corresponding to a Java method. These vectors were normalized 22% of the 68 Java methods yield a Spearman’s 𝜌 with 𝑝 ≤ 0.05.
to sum to 1, and reflect how important each AST token is for the As a baseline, the Spearman’s 𝜌 has 𝑝 ≤ 0.05 for all Java methods
human/model. We answer these research questions: between human duration and fixation focus vectors, and for 85% of
• RQ1: Is there a general correlation between human and ma- Java methods between any two language model’s focus vectors. This
chine focus patterns for code summarization? implies that any existing correlation between human and machine
• RQ2: Do the code summaries increase in quality when ma- foci is not widespread across the Java methods studied.
chine focus becomes more aligned with that of humans? Furthermore, among those Java methods where the correlation is
statistically significant, the mean Spearman’s 𝜌 is small with a large
3.3.1 RQ1. We assess the correlation between human and machine
3We use fixation durations to represent human focus. We empirically verify that using
foci across the 68 Java methods. Specifically, for each pair of focus
fixation count yields similar results.
4 Note that 𝑣
𝑐𝑜𝑟 contains Spearman’s 𝜌 ’s obtained from all six language models. We
1 Our analyses do not include brackets or semi-colons, or other such syntactic elements.
empirically verify that conducting the analogous analysis for each language model
2We use the absolute value by choice, without which experiments show similar results. separately yields a similar result.

49
ICPC ’24, April 15–16, 2024, Lisbon, Portugal Jiliang Li, Yifan Zhang, Zachary Karas, Collin McMillan, Kevin Leach, and Yu Huang

Table 1: Pair-wise correlation among focus sources; “Dura- Ratings of Language Models on a Scale from 1-4
3.5 Accuracy
tion” and “Count” are human visual focus. Each cell shows Completeness
3.0 Conciseness
the means and standard deviations of Spearman’s 𝜌 for all Readability
2.5
Java methods showing significant correlation (𝑝 ≤ 0.05).

Ratings
2.0
1.5
Duration Count GPT4 GPT-few GPT3.5 StarCoder CodeLlama NCS 1.0
0.5
Duration 1.00±0.00 0.88±0.06 -0.11±0.41 -0.13±0.42 -0.09±0.52 -0.18±0.48 -0.18±0.42 -0.24±0.40
Count — 1.00±0.00 0.01±0.45 -0.24±0.33 -0.10±0.48 -0.31±0.29 -0.13±0.43 -0.33±0.33 0.0
1.00±0.00 0.68±0.12 0.76±0.12 0.67±0.14 0.67±0.14 0.55±0.13 GPT4 GPT-few-shot GPT3.5 StarCoder Code Llama NCS
GPT4 — — Language Models
GPT-few — — — 1.00±0.00 0.72±0.12 0.62±0.15 0.64±0.15 0.55±0.13
GPT3.5 — — — — 1.00±0.00 0.65±0.16 0.67±0.15 0.58±0.13 Figure 2: Average ratings of model-generated summaries.
StarCoder — — — — — 1.00±0.00 0.66±0.15 0.59±0.11
Code Llama
NCS
—
—
—
—
—
—
—
—
—
—
—
—
1.00±0.00 0.56±0.14
— 1.00±0.00 4.2 RQ2: Summary Qualities
There is also a lack of correlation between the quality of summaries
Relative Difference in Machine vs. Human Focus on Semantic Categories generated by language models and how well their focus on code
Literal
Assignment aligns with humans’. The large p-values in Table 1 suggest that,
Operator
Comment
Operation regardless of which metric is used to assess summary quality, there
Exception Handling
Semantic Category

Method Call
External Variable/Method
Return is a lack of statistically significant correlation between the quality
Variable
External Class
Argument of a model-generated summary on a Java method and how well
Conditional Block
Loop
Conditional Statement the model’s focus aligns with that of humans on that Java method.
Variable Declaration
Method Declaration
Parameter
Furthermore, Figure 2 shows that NCS produces worse summaries
0 50 100 150 200
Difference (%) than the other five models. Although Table 1 seems to suggest that
NCS’s focus is more negatively aligned with human attention, we
Figure 1: How much more/less do language models focus on find no statistically significant metrics supporting such a claim,
each semantic category compared to humans? partially due to the small sample size of Java methods yielding
standard deviation. In fact, for most such methods where a model statistically significant Spearman’s 𝜌.
and human show significant correlation in focus, the Spearman’s In general, Table 1 suggests that feature attribution in NCS is still
𝜌 is often either around 0.5 or −0.5, but rarely in between. This moderately positively aligned with that in other language models
suggests the relationship between human and machine foci varies on a majority of Java methods. This indicates the likelihood that
significantly depending on the specific Java method. aspects other than feature attribution are more indicative of and
Interestingly, although few-shot-alignment in GPT-few-shot ren- critical to a language model’s performance in code summarization.
ders the model’s generated summaries more similar to those of Discussion Point 2: With a substantial body of work in NLP
humans, this does not lead to higher correlations between model showing that aligning neural models with human visual patterns
and human foci. In addition, feature attribution in all language can lead to performance improvement [26, 34, 10, 5], we contain
models is moderately or strongly positively correlated with each our conclusion to the SHAP measure of feature attribution and the
other on a majority of Java methods, which intuitively makes sense human attention as measured in an eye-tracking experiment. The
since all six models are based on the Transformer architecture. link between human attention and feature attribution to machine
We also investigate how language models’ focus on each seman- models is a subject of intense scientific investigation. We contribute
tic category differs from that of humans. As shown in Figure 1, to the debate with this finding that SHAP did not correlate with
language models’ generation of code summaries seems to be more human eye attention in the measures or models we studied.
reliant on comments, return values, and specific statements such
as literals and assignments, and less reliant on method calls and 5 CONCLUSION
variables/methods not defined explicitly within the Java method. In this paper, we use a state-of-the-art, black-box, perturbation-
Discussion Point 1: We find no evidence that feature attribution based method to assess feature attribution in language models on
in language models is correlated with programmers’ visual focus. code summarization tasks. We then compare the model-determined
Several possible interpretations can be inferred: (1) Alternative important AST tokens with those identified by human visual focus,
methods may be needed to assess feature influence in black-box as measured through eye-tracking. The results suggest that using
language models for code summarization, aiming for better align- SHAP to measure feature attribution does not provide explainabil-
ment with human attention. (2) Access to the internal workings of ity of language models through establishing correlations between
proprietary models might become critical if white-box models offer machine and human foci. Generally, our work can be interpreted
more human-aligned insights into explainable language models in two ways. First, feature attribution measured by SHAP may not
for code [20]. (3) It is possible that language models and humans be the best way to interpret a language model’s focus during code
reason about code differently when summarizing source code. summarization as it fails to establish similarities with human focus.
Table 2: Correlation between human-machine focus align- Alternatively, it may be the case that machines reason about code
ment and summary quality (assessed by four metrics). differently from humans when tasked to summarize source code.

Accuracy Completeness Conciseness Readability 6 ACKNOWLEDGEMENT

Spearman’s 𝜌 -0.1279 0.1309 0.0194 -0.0717 This research was supported by NSF CCF-2211429, NSF CCF-2211428,
𝑝-value 0.3862 0.3753 0.8960 0.6280
NSF CCF-2100035 and NSF SaTC-2312057.

50
Do Machines and Humans Focus on Similar Code? Exploring Explainability of Large Language Models in Code Summarization ICPC ’24, April 15–16, 2024, Lisbon, Portugal

REFERENCES [31] Xu, F. F., Alon, U., Neubig, G., and Hellendoorn, V. J. A systematic evalua-
[1] Tobii pro fusion user manual, Jun 2023. tion of large language models of code. In Proceedings of the 6th ACM SIGPLAN
[2] Abid, N. J., Maletic, J. I., and Sharif, B. Using developer eye movements to International Symposium on Machine Programming (2022), pp. 1–10.
externalize the mental model used in code summarization tasks. In Proceedings of [32] Zeng, Z., Tan, H., Zhang, H., Li, J., Zhang, Y., and Zhang, L. An extensive study
the 11th ACM Symposium on Eye Tracking Research & Applications (2019), pp. 1–9. on pre-trained models for program understanding and generation. In Proceedings
[3] Ahmad, W. U., Chakraborty, S., Ray, B., and Chang, K.-W. A transformer-based of the 31st ACM SIGSOFT international symposium on software testing and analysis
approach for source code summarization. arXiv preprint arXiv:2005.00653 (2020). (2022), pp. 39–51.
[4] Bansal, A., Su, C.-Y., Karas, Z., Zhang, Y., Huang, Y., Li, T. J.-J., and McMil- [33] Zhang, K., Li, G., and Jin, Z. What does transformer learn about source code?
lan, C. Modeling programmer attention as scanpath prediction. arXiv preprint arXiv preprint arXiv:2207.08466 (2022).
arXiv:2308.13920 (2023). [34] Zhang, Y., and Zhang, C. Using human attention to extract keyphrase from
[5] Barrett, M., Bingel, J., Hollenstein, N., Rei, M., and Søgaard, A. Sequence microblog post. In Proceedings of the 57th Annual Meeting of the Association for
classification with human attention. In Proceedings of the 22nd conference on Computational Linguistics (2019), pp. 5867–5872.
computational natural language learning (2018), pp. 302–312. [35] Zhang, Z., Zhang, H., Shen, B., and Gu, X. Diet code is healthy: Simplifying
[6] Collard, M. L., Decker, M. J., and Maletic, J. I. srcml: An infrastructure for the programs for pre-trained models of code. In Proceedings of the 30th ACM Joint
exploration, analysis, and manipulation of source code: A tool demonstration. In European Software Engineering Conference and Symposium on the Foundations of
2013 IEEE International conference on software maintenance (2013), IEEE, pp. 516– Software Engineering (2022), pp. 1073–1084.
519.
[7] Galassi, A., Lippi, M., and Torroni, P. Attention in natural language processing.
IEEE transactions on neural networks and learning systems 32, 10 (2020), 4291–4308.
[8] Hooker, S., Erhan, D., Kindermans, P.-J., and Kim, B. Evaluating feature im-
portance estimates.
[9] Huber, D., Paltenghi, M., and Pradel, M. Where to look when repairing
code? comparing the attention of neural models and developers. arXiv preprint
arXiv:2305.07287 (2023).
[10] Klerke, S., Goldberg, Y., and Søgaard, A. Improving sentence compression by
learning to predict gaze. arXiv preprint arXiv:1604.03357 (2016).
[11] Kou, B., Chen, S., Wang, Z., Ma, L., and Zhang, T. Is model attention aligned
with human attention? an empirical study on large language models for code
generation. arXiv preprint arXiv:2306.01220 (2023).
[12] LeClair, A., and McMillan, C. Recommendations for datasets for source code
summarization. arXiv preprint arXiv:1904.02660 (2019).
[13] Li, R., Allal, L. B., Zi, Y., Muennighoff, N., Kocetkov, D., Mou, C., Marone,
M., Akiki, C., Li, J., Chim, J., et al. Starcoder: may the source be with you! arXiv
preprint arXiv:2305.06161 (2023).
[14] Liu, Y., Tantithamthavorn, C., Liu, Y., and Li, L. On the reliability and explain-
ability of automated code generation approaches. arXiv preprint arXiv:2302.09587
(2023).
[15] Lundberg, S. M., and Lee, S.-I. A unified approach to interpreting model predic-
tions. Advances in neural information processing systems 30 (2017).
[16] Molnar, C. Interpretable machine learning. Lulu. com, 2020.
[17] Nijkamp, E., Pang, B., Hayashi, H., Tu, L., Wang, H., Zhou, Y., Savarese, S.,
and Xiong, C. Codegen: An open large language model for code with multi-turn
program synthesis. arXiv preprint arXiv:2203.13474 (2022).
[18] OpenAI. Introducing chatgpt. https://openai.com/blog/chatgpt/, 2023. Accessed:
11/20/2023.
[19] OpenAI, R. Gpt-4 technical report. arxiv 2303.08774. View in Article 2 (2023).
[20] Paltenghi, M., Pandita, R., Henley, A. Z., and Ziegler, A. Extracting meaning-
ful attention on source code: An empirical study of developer and neural model
code exploration. arXiv preprint arXiv:2210.05506 (2022).
[21] Paltenghi, M., and Pradel, M. Thinking like a developer? comparing the atten-
tion of humans with neural models of code. In 2021 36th IEEE/ACM International
Conference on Automated Software Engineering (ASE) (2021), IEEE, pp. 867–879.
[22] Rodeghero, P., and McMillan, C. An empirical study on the patterns of eye
movement during summarization tasks. In 2015 ACM/IEEE International Sympo-
sium on Empirical Software Engineering and Measurement (ESEM) (2015), IEEE,
pp. 1–10.
[23] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X. E., Adi, Y.,
Liu, J., Remez, T., Rapin, J., et al. Code llama: Open foundation models for code.
arXiv preprint arXiv:2308.12950 (2023).
[24] Sharafi, Z., Shaffer, T., Sharif, B., and Guéhéneuc, Y.-G. Eye-tracking metrics
in software engineering. In 2015 Asia-Pacific Software Engineering Conference
(APSEC) (2015), IEEE, pp. 96–103.
[25] Shrikumar, A., Greenside, P., and Kundaje, A. Learning important features
through propagating activation differences. In International conference on machine
learning (2017), PMLR, pp. 3145–3153.
[26] Sood, E., Tannert, S., Müller, P., and Bulling, A. Improving natural language
processing tasks with human gaze-guided neural attention. Advances in Neural
Information Processing Systems 33 (2020), 6327–6341.
[27] Spearman, C. The proof and measurement of association between two things.
[28] Sundararajan, M., Taly, A., and Yan, Q. Axiomatic attribution for deep net-
works. In International conference on machine learning (2017), PMLR, pp. 3319–
3328.
[29] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N.,
Kaiser, Ł., and Polosukhin, I. Attention is all you need. Advances in neural
information processing systems 30 (2017).
[30] Wang, Y., Wang, K., and Wang, L. Wheacha: A method for explaining the
predictions of code summarization models. arXiv preprint arXiv:2102.04625 (2021).

The AI Wealth Creation Blueprint PDF
67% (3)
The AI Wealth Creation Blueprint PDF
50 pages
The Age of AI and Our Human Future (Henry Kissinger, Eric Schmidt Etc.) (Z-Library)
100% (8)
The Age of AI and Our Human Future (Henry Kissinger, Eric Schmidt Etc.) (Z-Library)
148 pages
How To Hack Atm
87% (15)
How To Hack Atm
1 page
Christopher Langan - CTMU, The Cognitive-Theoretic Model of The Universe, A New Kind of Reality Theory
88% (8)
Christopher Langan - CTMU, The Cognitive-Theoretic Model of The Universe, A New Kind of Reality Theory
56 pages
Data Structure and Algorithmic Thinking With Python Data Structure and Algorithmic Puzzles PDF
95% (20)
Data Structure and Algorithmic Thinking With Python Data Structure and Algorithmic Puzzles PDF
471 pages
Gayle Laakmann McDowell - Cracking The Coding Interview - 189 Programming Questions and Solutions (2015, CareerCup)
81% (48)
Gayle Laakmann McDowell - Cracking The Coding Interview - 189 Programming Questions and Solutions (2015, CareerCup)
708 pages
Gödel, Escher, Bach - An Eternal Golden Braid (20th Anniversary Edition) by Douglas R. Hofstadter (Charm-Quark) PDF
100% (10)
Gödel, Escher, Bach - An Eternal Golden Braid (20th Anniversary Edition) by Douglas R. Hofstadter (Charm-Quark) PDF
821 pages
Cracking The Coding Interview - 189 Programming Questions and Solutions (6th Edition) (EnglishOnlineClub - Com)
100% (10)
Cracking The Coding Interview - 189 Programming Questions and Solutions (6th Edition) (EnglishOnlineClub - Com)
708 pages
Chris Bailey - Hyperfocus - The New Science of Attention, Productivity, and Creativity-Viking (2018)
100% (25)
Chris Bailey - Hyperfocus - The New Science of Attention, Productivity, and Creativity-Viking (2018)
306 pages
The Art of Asking ChatGPT For High-Quality Answers A Complete Guide To Prompt Engineering Techniques (Ibrahim John) (Z-Library)
100% (24)
The Art of Asking ChatGPT For High-Quality Answers A Complete Guide To Prompt Engineering Techniques (Ibrahim John) (Z-Library)
52 pages
The Fabric of Reality
100% (1)
The Fabric of Reality
6 pages
Banana Pancakes - Ukulele Chord Chart
100% (1)
Banana Pancakes - Ukulele Chord Chart
2 pages
75 Productivity Hacks - System Sunday
100% (7)
75 Productivity Hacks - System Sunday
75 pages
Implementing Design Patterns in C# and .NET 5: Build Scalable, Fast, and Reliable .NET Applications Using the Most Common Design Patterns (English Edition)
From Everand
Implementing Design Patterns in C# and .NET 5: Build Scalable, Fast, and Reliable .NET Applications Using the Most Common Design Patterns (English Edition)
Alexandre F. Malavasi Cardoso
No ratings yet
Military Remote Viewing Manual
100% (5)
Military Remote Viewing Manual
72 pages
Cs 229, Autumn 2016 Problem Set #2: Naive Bayes, SVMS, and Theory
No ratings yet
Cs 229, Autumn 2016 Problem Set #2: Naive Bayes, SVMS, and Theory
20 pages
Machine Learning For Humans
100% (4)
Machine Learning For Humans
97 pages
Partes de Cameras Fotograficas
No ratings yet
Partes de Cameras Fotograficas
2 pages
Programming Paradigms
From Everand
Programming Paradigms
Zoe Codewell
No ratings yet
C# OOP Step by Step: A Practical Guide with Examples
From Everand
C# OOP Step by Step: A Practical Guide with Examples
William E. Clark
No ratings yet
ChatGPT Simplified: A Comprehensive Guide to Understanding and Utilizing AI Language Models, ChatGPT-4, ChatGPT Prompts, Fiction Writing, Blogging, Content Writing, Make Money Online
From Everand
ChatGPT Simplified: A Comprehensive Guide to Understanding and Utilizing AI Language Models, ChatGPT-4, ChatGPT Prompts, Fiction Writing, Blogging, Content Writing, Make Money Online
Silas Quantum
5/5 (1)
Coding Education Shifts
From Everand
Coding Education Shifts
Zoe Codewell
No ratings yet
Language Identification: Fundamentals and Applications
From Everand
Language Identification: Fundamentals and Applications
Fouad Sabry
No ratings yet
C# Algorithms for New Programmers: A Practical Guide with Examples
From Everand
C# Algorithms for New Programmers: A Practical Guide with Examples
William E. Clark
No ratings yet
Programming Best Practices for New Developers: A Practical Guide with Examples
From Everand
Programming Best Practices for New Developers: A Practical Guide with Examples
William E. Clark
No ratings yet
Mastering the Craft: Unleashing the Art of Software Engineering
From Everand
Mastering the Craft: Unleashing the Art of Software Engineering
Kiran Nagesh
No ratings yet
Clean Code Practices
From Everand
Clean Code Practices
Zoe Codewell
No ratings yet
Code With AI
From Everand
Code With AI
Kai Turing
No ratings yet
C# Fundamentals Made Simple: A Practical Guide with Examples
From Everand
C# Fundamentals Made Simple: A Practical Guide with Examples
William E. Clark
No ratings yet
C++ OOP Made Simple: A Practical Guide with Examples
From Everand
C++ OOP Made Simple: A Practical Guide with Examples
William E. Clark
No ratings yet
Academic English for Computer Science: Academic English
From Everand
Academic English for Computer Science: Academic English
Disigma Publications
No ratings yet
Introduction to Programming Languages
From Everand
Introduction to Programming Languages
IntroBooks Team
4/5 (1)
Algorithms Made Simple: Understanding the Building Blocks of Software
From Everand
Algorithms Made Simple: Understanding the Building Blocks of Software
William E. Clark
No ratings yet
C + +: C++ programming
From Everand
C + +: C++ programming
Ummed Singh
No ratings yet
Basics of Programming: A Comprehensive Guide for Beginners: Essential Coputer Skills, #1
From Everand
Basics of Programming: A Comprehensive Guide for Beginners: Essential Coputer Skills, #1
DG. Junior
No ratings yet
C# Debugging from Scratch: A Practical Guide with Examples
From Everand
C# Debugging from Scratch: A Practical Guide with Examples
William E. Clark
No ratings yet
Understanding Software Engineering Vol 2: Programming principles and concepts to build any software.
From Everand
Understanding Software Engineering Vol 2: Programming principles and concepts to build any software.
Gabriel Clemente
No ratings yet
Knowledge Reasoning: Fundamentals and Applications
From Everand
Knowledge Reasoning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Code Foundations
From Everand
Code Foundations
Zoe Codewell
No ratings yet
Statistical Semantics: Fundamentals and Applications
From Everand
Statistical Semantics: Fundamentals and Applications
Fouad Sabry
No ratings yet
The Art of Code: Exploring the World of Programming Languages
From Everand
The Art of Code: Exploring the World of Programming Languages
Sam Steed
No ratings yet
Major Dev Tools
From Everand
Major Dev Tools
Ishaan Patel
No ratings yet
CODING FOR ABSOLUTE BEGINNERS: How to Keep Your Data Safe from Hackers by Mastering the Basic Functions of Python, Java, and C++ (2022 Guide for Newbies)
From Everand
CODING FOR ABSOLUTE BEGINNERS: How to Keep Your Data Safe from Hackers by Mastering the Basic Functions of Python, Java, and C++ (2022 Guide for Newbies)
Eric Vargas
No ratings yet
Mastering the Art of C# Programming: Unraveling the Secrets of Expert-Level Programming
From Everand
Mastering the Art of C# Programming: Unraveling the Secrets of Expert-Level Programming
Steve Jones
No ratings yet
50 C# Concepts Every Developer Should Know
From Everand
50 C# Concepts Every Developer Should Know
Hernando Abella
No ratings yet
C# Functional Programming Made Easy: A Practical Guide with Examples
From Everand
C# Functional Programming Made Easy: A Practical Guide with Examples
William E. Clark
No ratings yet
C# Data Structures Explained: A Practical Guide with Examples
From Everand
C# Data Structures Explained: A Practical Guide with Examples
William E. Clark
No ratings yet
Mastering Computer Programming: A Comprehensive Guide
From Everand
Mastering Computer Programming: A Comprehensive Guide
Kondwani Hara
No ratings yet
Conceptual Dependency Theory: Fundamentals and Applications
From Everand
Conceptual Dependency Theory: Fundamentals and Applications
Fouad Sabry
No ratings yet
Constrained Conditional Model: Fundamentals and Applications
From Everand
Constrained Conditional Model: Fundamentals and Applications
Fouad Sabry
No ratings yet
Visual Word: Unlocking the Power of Image Understanding
From Everand
Visual Word: Unlocking the Power of Image Understanding
Fouad Sabry
No ratings yet
.NET Mastery: The .NET Interview Questions and Answers
From Everand
.NET Mastery: The .NET Interview Questions and Answers
Chetan Singh
No ratings yet
Implementing Design Patterns in C# 11 and .NET 7 - 2nd Edition: Learn how to design and develop robust and scalable applications using design patterns (English Edition)
From Everand
Implementing Design Patterns in C# 11 and .NET 7 - 2nd Edition: Learn how to design and develop robust and scalable applications using design patterns (English Edition)
Alexandre F. Malavasi Cardoso
No ratings yet
Object-Oriented Basics
From Everand
Object-Oriented Basics
Alisa Turing
No ratings yet
Artificial Intelligence Systems Integration: Fundamentals and Applications
From Everand
Artificial Intelligence Systems Integration: Fundamentals and Applications
Fouad Sabry
No ratings yet
The Art of JavaScript Design Patterns: Proven Techniques for Clean and Efficient Code
From Everand
The Art of JavaScript Design Patterns: Proven Techniques for Clean and Efficient Code
Aarav Joshi
No ratings yet
Constraint Satisfaction: Fundamentals and Applications
From Everand
Constraint Satisfaction: Fundamentals and Applications
Fouad Sabry
No ratings yet
The Definitive JavaScript Handbook: From Fundamentals to Cutting‑Edge Best Practices
From Everand
The Definitive JavaScript Handbook: From Fundamentals to Cutting‑Edge Best Practices
Aarav Joshi
No ratings yet
Functional Programming Step by Step: A Practical Guide with Examples
From Everand
Functional Programming Step by Step: A Practical Guide with Examples
William E. Clark
No ratings yet
Software Languages
From Everand
Software Languages
Talon Zinc
No ratings yet
A Guide To All Programming and Coding Languages
From Everand
A Guide To All Programming and Coding Languages
Don Carlos
No ratings yet
Functional Programming
From Everand
Functional Programming
Zoe Codewell
No ratings yet
LLM Aiml
No ratings yet
LLM Aiml
2 pages
PowerShell Practitioner: Understanding The Core Building Blocks of Programming & Scripting through PowerShell, Plus Debunking Popular Misconceptions
From Everand
PowerShell Practitioner: Understanding The Core Building Blocks of Programming & Scripting through PowerShell, Plus Debunking Popular Misconceptions
Stevens-Sobolewski Justin
No ratings yet
The Newbie’s Guidebook to ChatGPT: A Beginner's Tutorial: The Newbie’s Guidebook
From Everand
The Newbie’s Guidebook to ChatGPT: A Beginner's Tutorial: The Newbie’s Guidebook
Timothy King
No ratings yet
Mastering C# Concurrency
From Everand
Mastering C# Concurrency
Agafonov Eugene
2/5 (2)
Concurrency in C++: Writing High-Performance Multithreaded Code
From Everand
Concurrency in C++: Writing High-Performance Multithreaded Code
Robert Johnson
No ratings yet
Software Architecture with Kotlin: Combine various architectural styles to create sustainable and scalable software solutions
From Everand
Software Architecture with Kotlin: Combine various architectural styles to create sustainable and scalable software solutions
Jason (Tsz Shun) Chow
No ratings yet
Relationship Extraction: Fundamentals and Applications
From Everand
Relationship Extraction: Fundamentals and Applications
Fouad Sabry
No ratings yet
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
From Everand
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
Robert Johnson
No ratings yet
JavaScript OOP Step by Step: A Practical Guide with Examples
From Everand
JavaScript OOP Step by Step: A Practical Guide with Examples
William E. Clark
No ratings yet
Natural Language Understanding: Fundamentals and Applications
From Everand
Natural Language Understanding: Fundamentals and Applications
Fouad Sabry
No ratings yet
AI Prompts & Power of Words
From Everand
AI Prompts & Power of Words
D.Cyrus
No ratings yet
Programming in Star
From Everand
Programming in Star
Francis McCabe
No ratings yet
On the Biology of a Large Language Model
No ratings yet
On the Biology of a Large Language Model
98 pages
Distilled GPT For Source Code Summarization: Chia-Yi Su and Collin Mcmillan
No ratings yet
Distilled GPT For Source Code Summarization: Chia-Yi Su and Collin Mcmillan
26 pages
Attention in Natural Language Processing: Andrea Galassi, Marco Lippi, and Paolo Torroni
No ratings yet
Attention in Natural Language Processing: Andrea Galassi, Marco Lippi, and Paolo Torroni
18 pages
The Secrets of A Slot Machine
No ratings yet
The Secrets of A Slot Machine
4 pages
Roadmap How To Learn AI in 2024 (Uncovered AI)
No ratings yet
Roadmap How To Learn AI in 2024 (Uncovered AI)
6 pages
Teas Topics To Study
100% (12)
Teas Topics To Study
6 pages
From Music To Mathematic
100% (1)
From Music To Mathematic
4 pages
My Ai Cheat List
100% (11)
My Ai Cheat List
3 pages
2045: The Year Man Becomes Immortal
No ratings yet
2045: The Year Man Becomes Immortal
9 pages
Wisc V Interpretation
100% (1)
Wisc V Interpretation
8 pages
Attention Is All You Need
67% (3)
Attention Is All You Need
11 pages
Rationality From AI To Zombies
86% (7)
Rationality From AI To Zombies
1,813 pages
Mind Control Patents
100% (1)
Mind Control Patents
41 pages
Tech Trend 2024 Report-2
No ratings yet
Tech Trend 2024 Report-2
11 pages
Python Programming and Maching Learning 2 in 1 B08Y5DPX32
100% (7)
Python Programming and Maching Learning 2 in 1 B08Y5DPX32
145 pages
Psych Unit 7a Practice Quiz
No ratings yet
Psych Unit 7a Practice Quiz
4 pages
Current and Future Trends on AI Applications - Mohammed A Al-Sharafi
No ratings yet
Current and Future Trends on AI Applications - Mohammed A Al-Sharafi
456 pages
Memorex Mt1136a Micro-Oec3039b
100% (1)
Memorex Mt1136a Micro-Oec3039b
28 pages
Data Sheet
No ratings yet
Data Sheet
5 pages
VSAN-0130 Hyperconverged Infrastructure For Dummies VMware and Intel Special Edition
50% (2)
VSAN-0130 Hyperconverged Infrastructure For Dummies VMware and Intel Special Edition
77 pages
PRX Series Modulator Parts List
No ratings yet
PRX Series Modulator Parts List
9 pages
Section 1 : Components of Computer System
No ratings yet
Section 1 : Components of Computer System
27 pages
10.BLDC Controller XBD Type Manual English
100% (4)
10.BLDC Controller XBD Type Manual English
30 pages
Types of Defects:during Usability, Functional and Non Functional Test Execution On Our Application
No ratings yet
Types of Defects:during Usability, Functional and Non Functional Test Execution On Our Application
1 page
Sheffer Stroke: Stroke, Named After Henry M. Sheffer, Written " - " (See
No ratings yet
Sheffer Stroke: Stroke, Named After Henry M. Sheffer, Written " - " (See
5 pages
2019 20 Placement Details - Website
No ratings yet
2019 20 Placement Details - Website
23 pages
S & P LAS 2 Constructing Probability Distributions
No ratings yet
S & P LAS 2 Constructing Probability Distributions
6 pages
Blocking
No ratings yet
Blocking
2 pages
Positioner: Installation Operation and Maintenance Manual
No ratings yet
Positioner: Installation Operation and Maintenance Manual
131 pages
Manual de Usuario
No ratings yet
Manual de Usuario
38 pages
Data Science R SLB
No ratings yet
Data Science R SLB
3 pages
Db2 For zOS Hot Topics and Best Practices With John Campbell Part 2
No ratings yet
Db2 For zOS Hot Topics and Best Practices With John Campbell Part 2
33 pages
lastCleanException 20220426154557
No ratings yet
lastCleanException 20220426154557
20 pages
Sentence Patterns Edited
No ratings yet
Sentence Patterns Edited
33 pages
FEA Report (Raj)
No ratings yet
FEA Report (Raj)
16 pages
ICT - Glossary: See See MPEG-3 See MPEG-4 See See See See Rich Text Format See Text File See
No ratings yet
ICT - Glossary: See See MPEG-3 See MPEG-4 See See See See Rich Text Format See Text File See
20 pages
Excel Notes
No ratings yet
Excel Notes
169 pages
Benchmarking
No ratings yet
Benchmarking
3 pages
Autocad PPT in Cmrit
No ratings yet
Autocad PPT in Cmrit
17 pages
Forensic Speech Recognition
No ratings yet
Forensic Speech Recognition
11 pages
CN3421 Lecture Note 1 - Introduction
No ratings yet
CN3421 Lecture Note 1 - Introduction
20 pages
App Note DVB S2x and NovelSat NS3
No ratings yet
App Note DVB S2x and NovelSat NS3
8 pages
Read Me
100% (1)
Read Me
2 pages
LabVIEW - Real-Time Application Development Course Manual PDF
100% (4)
LabVIEW - Real-Time Application Development Course Manual PDF
444 pages
Simon J Lock UCLSMP Townmeeting Presentation
No ratings yet
Simon J Lock UCLSMP Townmeeting Presentation
10 pages
Personal Profile: Sadiq Ali Jaffer
No ratings yet
Personal Profile: Sadiq Ali Jaffer
2 pages

4

Uploaded by

4

Uploaded by

2024 IEEE/ACM 32nd International Conference on Program Comprehension (ICPC)

Do Machines and Humans Focus on Similar Code? Exploring

Collin McMillan Kevin Leach Yu Huang

Accuracy Completeness Conciseness Readability 6 ACKNOWLEDGEMENT

You might also like