4
4
ABSTRACT KEYWORDS
Recent language models have demonstrated proficiency in summa- Neural Code Summarization, Language Models, Explainable AI,
rizing source code. However, as in many other domains of machine SHAP, Human Attention, Eye-Tracking
learning, language models of code lack sufficient explainability —
informally, we lack a formulaic or intuitive understanding of what ACM Reference Format:
and how models learn from code. Explainability of language models Jiliang Li, Yifan Zhang, Zachary Karas, Collin McMillan, Kevin Leach, and Yu
can be partially provided if, as the models learn to produce higher- Huang. 2024. Do Machines and Humans Focus on Similar Code? Exploring
quality code summaries, they also align in deeming the same code Explainability of Large Language Models in Code Summarization. In 32nd
parts important as those identified by human programmers. In this IEEE/ACM International Conference on Program Comprehension (ICPC ’24),
April 15–16, 2024, Lisbon, Portugal. ACM, New York, NY, USA, 5 pages.
paper, we report negative results from our investigation of explain-
https://doi.org/10.1145/3643916.3644434
ability of language models in code summarization through the lens
of human comprehension. We measure human focus on code using
eye-tracking metrics such as fixation counts and duration in code
summarization tasks. To approximate language model focus, we 1 INTRODUCTION
employ a state-of-the-art model-agnostic, black-box, perturbation- Recent language models for code have shown promising perfor-
based approach, SHAP (SHapley Additive exPlanations), to identify mance on several code-related tasks [31]. Among these tasks is
which code tokens influence that generation of summaries. Using neural code summarization, where a language model generates a
these settings, we find no statistically significant relationship be- short natural language summary describing a given code snippet.
tween language models’ focus and human programmers’ attention. This is often an indicative task demonstrating a model’s ability
Furthermore, alignment between model and human foci in this to comprehend code. Currently, the majority of assessments for
setting does not seem to dictate the quality of the LLM-generated how well a language model understands code directly measures the
summaries. Our study highlights an inability to align human focus quality of code summaries generated by the models, and compares
with SHAP-based model focus measures. This result calls for future them with human-written summaries [31]. Comparatively little is
investigation of multiple open questions for explainable language known about why and how the language models reason about code
models for code summarization and software engineering tasks in to generate such summaries. Similar to many other downstream
general, including the training mechanisms of language models for domains of machine learning in software engineering, understand-
code, whether there is an alignment between human and model ing and explaining how and why language models for code work
attention on code, whether human attention can improve the devel- (or fail) is critical to improving model architecture, reducing bias,
opment of language models, and what other model focus measures and preventing undesirable model behavior.
are appropriate for improving explainability. Human programmers typically achieve a strong understanding
of code. Thus, proficient language models might be explained if
CCS CONCEPTS they focus on the same parts of code that humans would [21]. Eye-
• Computing methodologies → Artificial intelligence. tracking studies have been conducted to analyze programmers’
visual patterns while reading code [2, 22]. Specifically, the duration
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed and frequency of a programmer’s eye gaze on a part of code in a
for profit or commercial advantage and that copies bear this notice and the full citation spatially-stable manner, referred to as fixation duration and fixation
on the first page. Copyrights for components of this work owned by others than the count respectively, are indicative of cognitive load [24]. Thus, these
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission measures of eye-tracking can indicate the parts of code on which
and/or a fee. Request permissions from [email protected]. human programmers focus. In contrast, there is a lack of consensus
ICPC ’24, April 15–16, 2024, Lisbon, Portugal on how to measure a language model’s reasoning about code (see
© 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 979-8-4007-0586-1/24/04. . . $15.00 Section 2.2). Most existing works extract the self-attention layers in
https://doi.org/10.1145/3643916.3644434 language models for code to measure the model attention [21, 20, 9].
47
ICPC ’24, April 15–16, 2024, Lisbon, Portugal Jiliang Li, Yifan Zhang, Zachary Karas, Collin McMillan, Kevin Leach, and Yu Huang
Such methods require direct access to the internal layers of a lan- In contrast, state-of-the-art black-box approaches like SHAP [15]
guage model, limiting the possibility to investigate interpretability (SHapley Additive exPlanations) apply game-theoretic principles
of many state-of-the-art proprietary models (e.g., ChatGPT). to assess the impact of input variations on a model’s output. SHAP
In this paper, to investigate how proprietary language models evaluates the effects of different combinations of input features
reason about code, we employ a state-of-the-art perturbation-based — such as tokens in a text sequence — by observing how their
method, SHAP [15] (SHapley Additive exPlanations), that treats presence or absence (simulated by token masking) alters the model’s
each language model as a black-box function. With SHAP, we an- prediction from an expected result. This process helps to ascertain
alyze the feature attribution (i.e., which parts of code are deemed the relative contribution of each feature to the output, allowing for
important by the model) in six different state-of-the-art language an analysis of the model without requiring access to its internal
models for code. We use a set of Java methods to task both the architecture [14, 11]. In this paper, to investigate proprietary models,
language models and human programmers with writing code sum- we employ SHAP to measure where language models focus on code.
maries. The feature attribution in the language models, measured
by SHAP, is then compared with human developers’ focus, col- 2.3 Comparing Human vs. Machine Attention
lected from eye-tracking. We hypothesize that sufficiently large Previous papers have examined the alignment between human
models may learn to focus on parts of code similarly to humans. If and model attention in code comprehension tasks. Paltenghi et
validated, language model behavior can thus be described in terms al. [20] found that CodeGen’s [17] self-attention layers attend to
of human behavior, ultimately helping to explain and improve similar parts of code compared to human programmers’ visual
language models. However, we find that explainability cannot be fixations when answering comprehension questions about code.
provided through this lens and find no statistically significant evi- Similarly, Huber et al. [9] discovered overlaps in attention patterns
dence suggesting the hypothesized alignment. Furthermore, we did between neural models and humans when repairing buggy pro-
not find that language models’ focus exhibits a statistically signifi- grams. Notably, Paltenghi and Prasdel [21] compared language
cant correlation with human focus in general. For future research models’ self-attention weights and humans’ visual attention during
that aims to explore the explainability of language models for code code summarization. They found that model attention, measured
summarization, especially for those leveraging human attention, by self-attention weights, does not align well with human attention.
our findings might suggest the following: (1) though widely used However, this work is limited by investigating only small CNN and
in AI, SHAP may not be an optimal method to investigate where transformer models. Most importantly, all aforementioned studies
language models focus during code summarization, or alternatively, used white-box approaches towards interpretability of open-source
(2) a misalignment between language models and human develop- models, limiting applicability to state-of-the-art proprietary models.
ers in reasoning about code may provide insights for improving AI Recently, Kou et al. [11] utilized both white-box and black-box
models for code summarization. perturbation-based approaches to measure LLMs’ focus in code
generation tasks, and discovered a consistent misalignment with
2 BACKGROUND AND RELATED WORK humans’ attention. In general, these works have demonstrated that
2.1 Neural Models for Code Summarization whether human and machine attention align depends heavily on
the methods employed to approximate machine focus, as well as
Advancements in deep learning have enabled machine learning
the specific code comprehension task examined. In this paper, we
models to generate summaries for source code. Among the state-
build upon former works by examining whether human attention
of-the-art models, NeuralCodeSum (NCS) first introduced the use
correlates with feature attribution in language models, measured by
of Transformers in neural code summarization [3]. With the rise
a black-box perturbation-based approach, in code summarization.
of large language models (LLMs), ServiceNow and HuggingFace
released a 15.5B parameter LLM for code, StarCoder [13], and Meta
released a 7B parameter LLM, Code LLama [23], both of which can 3 EXPERIMENTAL DESIGN
serve to summarize code. Although not inherently an LLM for code, 3.1 Measuring Human Visual Focus
GPT3.5 [18] and GPT4 [19] are also capable of code summarization. We used eye-tracking data measuring human attention from a con-
In this paper, we investigate how all the aforementioned models trolled human study with 27 programmers. The study obtained
reason about code when tasked to generate code summaries. IRB approval, and asked participants to read Java methods and
write accompanying summaries [4]. Each participant summarized
2.2 Interpretability of Language Models 24–25 Java methods from the FunCom dataset [12], yielding 671
Existing works on interpretable language models generally seek to trials of eye-tracking data in total. Considering data quality, two
investigate the relative importance of each input token for model authors with five and eight years of Java experience cooperatively
performance [29, 8, 16]. Such works can be commonly categorized removed participant data associated with five summaries that did
into two types: white-box vs. black-box. White-box approaches not demonstrate an understanding of the Java code.
require access to a language model’s internal layers [25, 28], often In this work, we sought to measure where humans and language
directly investigating the self-attention scores in Transformer-based models focus on code as they summarize it. We first used the srcML
models [7, 33, 32]. However, Transformer-based models’ inherent parser to convert each Java method into its corresponding Abstract
complexity has led to a lack of consensus on how to aggregate Syntax Tree (AST) representation [6]. The AST provides structural
attention weights [30, 35, 33]. For the general research community, context for each token literal (i.e., ‘Hello World’ →
− String Literal).
white-box approaches preclude proprietary models (e.g., ChatGPT). With the gaze coordinates collected from the eye-tracker [1], we
48
Do Machines and Humans Focus on Similar Code? Exploring Explainability of Large Language Models in Code Summarization ICPC ’24, April 15–16, 2024, Lisbon, Portugal
measured humans’ focus on each AST token. Typically, researchers sources, we iterate through each Java method and calculate the
use fixations to quantify human visual focus [24]. A fixation is de- Spearman’s rank coefficient (𝜌) [27] between the two sources’ vec-
fined as a spatially stable eye-movement lasting 100–300ms. Most tors for that method. Then, for each pair of focus sources, we report:
cognitive processing occurs during fixations [24], so researchers (1) The mean and standard deviation of Spearman’s 𝜌 across all Java
consider their frequency and duration in making inferences about methods where correlation is statistically significant (𝑝 ≤ 0.05),
human cognition. In our analyses, we computed the average count and (2) the proportion of Java methods demonstrating a statistically
and duration of programmers’ fixations on each AST token. Conse- significant correlation (𝑝 ≤ 0.05).
quently, for each Java method, we obtained two visual focus vectors In addition, we group all AST tokens into 18 semantic categories
with lengths equal to the number of AST tokens, respectively, which (e.g., method call, operator, etc.) and investigate how much hu-
represent fixation counts and durations on each token1 . mans3 and language models focus on each semantic category. The
focus score assigned to each semantic category is the sum of the
3.2 Measuring Model Focus focus scores assigned to each AST token belonging to that seman-
As mentioned in Section 2.2, we choose SHAP’s official, default tic category. To counter biases where certain semantic categories
implementation of the TeacherForcing method to measure feature contain more AST tokens or appear more frequently, we report
attribution in language models, treating each as a black-box func- the relative difference between machine and human foci for each
tion. For each language model, we pass in each of the 68 Java semantic category. That is, we average the six language models’
𝑓 𝑜𝑐 −𝑓 𝑜𝑐ℎ𝑢𝑚𝑎𝑛
methods (also read by human programmers) as input, along with focus scores per category and report | 𝑚𝑎𝑐ℎ𝑖𝑛𝑒 𝑓 𝑜𝑐 |.
ℎ𝑢𝑚𝑎𝑛
necessary prompting for the model to output summaries of source
3.3.2 RQ2. Here, a human expert provides quality ratings for sum-
code. For each Java method passed into each language model, we
maries generated by each language model for every Java method
let 𝑖 denote an input token (in code) and 𝑜 denote an output token
using four criteria: accuracy, completeness, conciseness, and read-
(in summary). For each (𝑖, 𝑜) pair, SHAP produces an importance
ability. Next, we calculate the Spearman’s 𝜌 between each language
score, denoted 𝑣 (𝑖,𝑜 ) , signifying how much 𝑖’s presence or absence
model’s focus vector and the human fixation duration vector across
alters the presence of 𝑜. Then, the importance score of each input
Í all Java methods where correlation is significant (𝑝 ≤ 0.05). We
token2 , 𝑣𝑖 , is calculated such that 𝑣𝑖 = 𝑜 |𝑣 (𝑖,𝑜 ) |. Note that now 𝑣𝑖
then append all such statistically significant 𝜌’s to form a vector, de-
is associated with a language model token, and each AST token may
noted 𝑣𝑐𝑜𝑟 , to represent the degrees of alignment between machine
consist of several language model tokens. Thus, for each AST token,
and human foci across the Java methods investigated4 .
we calculate its score 𝑛1 𝑛𝑗=1 𝑣 𝑗 , where 𝑣 1, · · · , 𝑣𝑛 are scores of lan-
Í
Subsequently, we determine whether this alignment is correlated
guage model tokens constituting the AST token. Consequently, for
with the rated quality of summaries. Specifically, we construct four
each language model on each Java method, we obtain a focus vector
other vectors, {𝑣𝑎𝑐𝑐 , 𝑣𝑐𝑜𝑚 , 𝑣𝑐𝑜𝑛 , 𝑣𝑟𝑒𝑎 }, containing the accuracy, com-
(with a length equal to the number of AST tokens) representing
pleteness, conciseness, and readability scores respectively. At each
how influential each AST token is to the model.
index 𝑖, {𝑣𝑐𝑜𝑟 [𝑖], 𝑣𝑎𝑐𝑐 [𝑖], 𝑣𝑐𝑜𝑚 [𝑖], 𝑣𝑐𝑜𝑛 [𝑖], 𝑣𝑟𝑒𝑎 [𝑖]} are respectively
In total, we investigated the model focus of six different models:
the Spearman’s 𝜌, summary accuracy, completeness, conciseness,
GPT4, GPT-few-shot, GPT3.5, StarCoder, Code Llama, and NCS.
and readability of the same language model applied on the same
Here, GPT-few-shot is a GPT3.5 model, but in an attempt for the
Java method. We then measure and report the Spearman’s rank
model to produce code summaries more similar to those of humans,
correlation between 𝑣𝑐𝑜𝑟 and 𝑣𝑖 , where 𝑣𝑖 ∈ {𝑣𝑎𝑐𝑐 , 𝑣𝑐𝑜𝑚 , 𝑣𝑐𝑜𝑛 , 𝑣𝑟𝑒𝑎 }.
we used few-shot prompting to instruct the model to provide sum-
maries similar to two randomly selected human-written summaries.
4 RESULTS
The other five state-of-the-art LLMs are introduced in Section 2.1
and implemented with their default parameters. 4.1 RQ1: General Correlation
As shown in Table 1, there is a general lack of correlation between
3.3 Comparing Human and Model Foci human and machine foci. We highlight that the means and standard
For brevity, we refer to two human visual focus measurements deviations in Table 1 are only calculated from Java methods where
(i.e., fixation duration and count) and six language models as eight the Spearman’s 𝜌 is statistically significant (with 𝑝 ≤ 0.05). In
"focus sources." For each source, we obtained 68 focus vectors, each practice, between any pair of human-LLM focus sources, at most
corresponding to a Java method. These vectors were normalized 22% of the 68 Java methods yield a Spearman’s 𝜌 with 𝑝 ≤ 0.05.
to sum to 1, and reflect how important each AST token is for the As a baseline, the Spearman’s 𝜌 has 𝑝 ≤ 0.05 for all Java methods
human/model. We answer these research questions: between human duration and fixation focus vectors, and for 85% of
• RQ1: Is there a general correlation between human and ma- Java methods between any two language model’s focus vectors. This
chine focus patterns for code summarization? implies that any existing correlation between human and machine
• RQ2: Do the code summaries increase in quality when ma- foci is not widespread across the Java methods studied.
chine focus becomes more aligned with that of humans? Furthermore, among those Java methods where the correlation is
statistically significant, the mean Spearman’s 𝜌 is small with a large
3.3.1 RQ1. We assess the correlation between human and machine
3We use fixation durations to represent human focus. We empirically verify that using
foci across the 68 Java methods. Specifically, for each pair of focus
fixation count yields similar results.
4 Note that 𝑣
𝑐𝑜𝑟 contains Spearman’s 𝜌 ’s obtained from all six language models. We
1 Our analyses do not include brackets or semi-colons, or other such syntactic elements.
empirically verify that conducting the analogous analysis for each language model
2We use the absolute value by choice, without which experiments show similar results. separately yields a similar result.
49
ICPC ’24, April 15–16, 2024, Lisbon, Portugal Jiliang Li, Yifan Zhang, Zachary Karas, Collin McMillan, Kevin Leach, and Yu Huang
Table 1: Pair-wise correlation among focus sources; “Dura- Ratings of Language Models on a Scale from 1-4
3.5 Accuracy
tion” and “Count” are human visual focus. Each cell shows Completeness
3.0 Conciseness
the means and standard deviations of Spearman’s 𝜌 for all Readability
2.5
Java methods showing significant correlation (𝑝 ≤ 0.05).
Ratings
2.0
1.5
Duration Count GPT4 GPT-few GPT3.5 StarCoder CodeLlama NCS 1.0
0.5
Duration 1.00±0.00 0.88±0.06 -0.11±0.41 -0.13±0.42 -0.09±0.52 -0.18±0.48 -0.18±0.42 -0.24±0.40
Count — 1.00±0.00 0.01±0.45 -0.24±0.33 -0.10±0.48 -0.31±0.29 -0.13±0.43 -0.33±0.33 0.0
1.00±0.00 0.68±0.12 0.76±0.12 0.67±0.14 0.67±0.14 0.55±0.13 GPT4 GPT-few-shot GPT3.5 StarCoder Code Llama NCS
GPT4 — — Language Models
GPT-few — — — 1.00±0.00 0.72±0.12 0.62±0.15 0.64±0.15 0.55±0.13
GPT3.5 — — — — 1.00±0.00 0.65±0.16 0.67±0.15 0.58±0.13 Figure 2: Average ratings of model-generated summaries.
StarCoder — — — — — 1.00±0.00 0.66±0.15 0.59±0.11
Code Llama
NCS
—
—
—
—
—
—
—
—
—
—
—
—
1.00±0.00 0.56±0.14
— 1.00±0.00 4.2 RQ2: Summary Qualities
There is also a lack of correlation between the quality of summaries
Relative Difference in Machine vs. Human Focus on Semantic Categories generated by language models and how well their focus on code
Literal
Assignment aligns with humans’. The large p-values in Table 1 suggest that,
Operator
Comment
Operation regardless of which metric is used to assess summary quality, there
Exception Handling
Semantic Category
Method Call
External Variable/Method
Return is a lack of statistically significant correlation between the quality
Variable
External Class
Argument of a model-generated summary on a Java method and how well
Conditional Block
Loop
Conditional Statement the model’s focus aligns with that of humans on that Java method.
Variable Declaration
Method Declaration
Parameter
Furthermore, Figure 2 shows that NCS produces worse summaries
0 50 100 150 200
Difference (%) than the other five models. Although Table 1 seems to suggest that
NCS’s focus is more negatively aligned with human attention, we
Figure 1: How much more/less do language models focus on find no statistically significant metrics supporting such a claim,
each semantic category compared to humans? partially due to the small sample size of Java methods yielding
standard deviation. In fact, for most such methods where a model statistically significant Spearman’s 𝜌.
and human show significant correlation in focus, the Spearman’s In general, Table 1 suggests that feature attribution in NCS is still
𝜌 is often either around 0.5 or −0.5, but rarely in between. This moderately positively aligned with that in other language models
suggests the relationship between human and machine foci varies on a majority of Java methods. This indicates the likelihood that
significantly depending on the specific Java method. aspects other than feature attribution are more indicative of and
Interestingly, although few-shot-alignment in GPT-few-shot ren- critical to a language model’s performance in code summarization.
ders the model’s generated summaries more similar to those of Discussion Point 2: With a substantial body of work in NLP
humans, this does not lead to higher correlations between model showing that aligning neural models with human visual patterns
and human foci. In addition, feature attribution in all language can lead to performance improvement [26, 34, 10, 5], we contain
models is moderately or strongly positively correlated with each our conclusion to the SHAP measure of feature attribution and the
other on a majority of Java methods, which intuitively makes sense human attention as measured in an eye-tracking experiment. The
since all six models are based on the Transformer architecture. link between human attention and feature attribution to machine
We also investigate how language models’ focus on each seman- models is a subject of intense scientific investigation. We contribute
tic category differs from that of humans. As shown in Figure 1, to the debate with this finding that SHAP did not correlate with
language models’ generation of code summaries seems to be more human eye attention in the measures or models we studied.
reliant on comments, return values, and specific statements such
as literals and assignments, and less reliant on method calls and 5 CONCLUSION
variables/methods not defined explicitly within the Java method. In this paper, we use a state-of-the-art, black-box, perturbation-
Discussion Point 1: We find no evidence that feature attribution based method to assess feature attribution in language models on
in language models is correlated with programmers’ visual focus. code summarization tasks. We then compare the model-determined
Several possible interpretations can be inferred: (1) Alternative important AST tokens with those identified by human visual focus,
methods may be needed to assess feature influence in black-box as measured through eye-tracking. The results suggest that using
language models for code summarization, aiming for better align- SHAP to measure feature attribution does not provide explainabil-
ment with human attention. (2) Access to the internal workings of ity of language models through establishing correlations between
proprietary models might become critical if white-box models offer machine and human foci. Generally, our work can be interpreted
more human-aligned insights into explainable language models in two ways. First, feature attribution measured by SHAP may not
for code [20]. (3) It is possible that language models and humans be the best way to interpret a language model’s focus during code
reason about code differently when summarizing source code. summarization as it fails to establish similarities with human focus.
Table 2: Correlation between human-machine focus align- Alternatively, it may be the case that machines reason about code
ment and summary quality (assessed by four metrics). differently from humans when tasked to summarize source code.
50
Do Machines and Humans Focus on Similar Code? Exploring Explainability of Large Language Models in Code Summarization ICPC ’24, April 15–16, 2024, Lisbon, Portugal
REFERENCES [31] Xu, F. F., Alon, U., Neubig, G., and Hellendoorn, V. J. A systematic evalua-
[1] Tobii pro fusion user manual, Jun 2023. tion of large language models of code. In Proceedings of the 6th ACM SIGPLAN
[2] Abid, N. J., Maletic, J. I., and Sharif, B. Using developer eye movements to International Symposium on Machine Programming (2022), pp. 1–10.
externalize the mental model used in code summarization tasks. In Proceedings of [32] Zeng, Z., Tan, H., Zhang, H., Li, J., Zhang, Y., and Zhang, L. An extensive study
the 11th ACM Symposium on Eye Tracking Research & Applications (2019), pp. 1–9. on pre-trained models for program understanding and generation. In Proceedings
[3] Ahmad, W. U., Chakraborty, S., Ray, B., and Chang, K.-W. A transformer-based of the 31st ACM SIGSOFT international symposium on software testing and analysis
approach for source code summarization. arXiv preprint arXiv:2005.00653 (2020). (2022), pp. 39–51.
[4] Bansal, A., Su, C.-Y., Karas, Z., Zhang, Y., Huang, Y., Li, T. J.-J., and McMil- [33] Zhang, K., Li, G., and Jin, Z. What does transformer learn about source code?
lan, C. Modeling programmer attention as scanpath prediction. arXiv preprint arXiv preprint arXiv:2207.08466 (2022).
arXiv:2308.13920 (2023). [34] Zhang, Y., and Zhang, C. Using human attention to extract keyphrase from
[5] Barrett, M., Bingel, J., Hollenstein, N., Rei, M., and Søgaard, A. Sequence microblog post. In Proceedings of the 57th Annual Meeting of the Association for
classification with human attention. In Proceedings of the 22nd conference on Computational Linguistics (2019), pp. 5867–5872.
computational natural language learning (2018), pp. 302–312. [35] Zhang, Z., Zhang, H., Shen, B., and Gu, X. Diet code is healthy: Simplifying
[6] Collard, M. L., Decker, M. J., and Maletic, J. I. srcml: An infrastructure for the programs for pre-trained models of code. In Proceedings of the 30th ACM Joint
exploration, analysis, and manipulation of source code: A tool demonstration. In European Software Engineering Conference and Symposium on the Foundations of
2013 IEEE International conference on software maintenance (2013), IEEE, pp. 516– Software Engineering (2022), pp. 1073–1084.
519.
[7] Galassi, A., Lippi, M., and Torroni, P. Attention in natural language processing.
IEEE transactions on neural networks and learning systems 32, 10 (2020), 4291–4308.
[8] Hooker, S., Erhan, D., Kindermans, P.-J., and Kim, B. Evaluating feature im-
portance estimates.
[9] Huber, D., Paltenghi, M., and Pradel, M. Where to look when repairing
code? comparing the attention of neural models and developers. arXiv preprint
arXiv:2305.07287 (2023).
[10] Klerke, S., Goldberg, Y., and Søgaard, A. Improving sentence compression by
learning to predict gaze. arXiv preprint arXiv:1604.03357 (2016).
[11] Kou, B., Chen, S., Wang, Z., Ma, L., and Zhang, T. Is model attention aligned
with human attention? an empirical study on large language models for code
generation. arXiv preprint arXiv:2306.01220 (2023).
[12] LeClair, A., and McMillan, C. Recommendations for datasets for source code
summarization. arXiv preprint arXiv:1904.02660 (2019).
[13] Li, R., Allal, L. B., Zi, Y., Muennighoff, N., Kocetkov, D., Mou, C., Marone,
M., Akiki, C., Li, J., Chim, J., et al. Starcoder: may the source be with you! arXiv
preprint arXiv:2305.06161 (2023).
[14] Liu, Y., Tantithamthavorn, C., Liu, Y., and Li, L. On the reliability and explain-
ability of automated code generation approaches. arXiv preprint arXiv:2302.09587
(2023).
[15] Lundberg, S. M., and Lee, S.-I. A unified approach to interpreting model predic-
tions. Advances in neural information processing systems 30 (2017).
[16] Molnar, C. Interpretable machine learning. Lulu. com, 2020.
[17] Nijkamp, E., Pang, B., Hayashi, H., Tu, L., Wang, H., Zhou, Y., Savarese, S.,
and Xiong, C. Codegen: An open large language model for code with multi-turn
program synthesis. arXiv preprint arXiv:2203.13474 (2022).
[18] OpenAI. Introducing chatgpt. https://openai.com/blog/chatgpt/, 2023. Accessed:
11/20/2023.
[19] OpenAI, R. Gpt-4 technical report. arxiv 2303.08774. View in Article 2 (2023).
[20] Paltenghi, M., Pandita, R., Henley, A. Z., and Ziegler, A. Extracting meaning-
ful attention on source code: An empirical study of developer and neural model
code exploration. arXiv preprint arXiv:2210.05506 (2022).
[21] Paltenghi, M., and Pradel, M. Thinking like a developer? comparing the atten-
tion of humans with neural models of code. In 2021 36th IEEE/ACM International
Conference on Automated Software Engineering (ASE) (2021), IEEE, pp. 867–879.
[22] Rodeghero, P., and McMillan, C. An empirical study on the patterns of eye
movement during summarization tasks. In 2015 ACM/IEEE International Sympo-
sium on Empirical Software Engineering and Measurement (ESEM) (2015), IEEE,
pp. 1–10.
[23] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X. E., Adi, Y.,
Liu, J., Remez, T., Rapin, J., et al. Code llama: Open foundation models for code.
arXiv preprint arXiv:2308.12950 (2023).
[24] Sharafi, Z., Shaffer, T., Sharif, B., and Guéhéneuc, Y.-G. Eye-tracking metrics
in software engineering. In 2015 Asia-Pacific Software Engineering Conference
(APSEC) (2015), IEEE, pp. 96–103.
[25] Shrikumar, A., Greenside, P., and Kundaje, A. Learning important features
through propagating activation differences. In International conference on machine
learning (2017), PMLR, pp. 3145–3153.
[26] Sood, E., Tannert, S., Müller, P., and Bulling, A. Improving natural language
processing tasks with human gaze-guided neural attention. Advances in Neural
Information Processing Systems 33 (2020), 6327–6341.
[27] Spearman, C. The proof and measurement of association between two things.
[28] Sundararajan, M., Taly, A., and Yan, Q. Axiomatic attribution for deep net-
works. In International conference on machine learning (2017), PMLR, pp. 3319–
3328.
[29] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N.,
Kaiser, Ł., and Polosukhin, I. Attention is all you need. Advances in neural
information processing systems 30 (2017).
[30] Wang, Y., Wang, K., and Wang, L. Wheacha: A method for explaining the
predictions of code summarization models. arXiv preprint arXiv:2102.04625 (2021).
51