Rename Chains: An Exploratory Study on the Occurrence and Characteristics of Identifiers Undergoing Multiple Renamings

Rename Chains:
An Exploratory Study on the Occurrence and Characteristics of
Identifiers Undergoing Multiple Renamings
Anthony Peruma & Christian Newman

Overview
We explore the phenomenon of a single
identifier undergoing multiple renames (i.e., a
rename chain) through a large-scale empirical
study of 800 open-source Java systems

Introduction
• Research shows that developers spend 58% of their time on program
comprehension activities
• Identifier names account for 70% of the characters in the code base
• Identifier names help developers understand the purpose of the
identifier – essential that names should be high-quality
• Names must be unambiguous and intent reveling in communicating the purpose
and behavior of the code
• Developers correct poor-quality names via rename refactoring
operations – over 40% of refactoring operations are renames
• Not all renames result in a high-quality name
• An identifier can undergo multiple renaming's throughout its lifetime (i.e., a chain
of renaming's – a rename chain)

Rename Chain Examples
A method rename chain resulting in a more
descriptive method name in the final rename
A rename chain resulting in a weak
method name; it is just a copy of the
statement within the method

Related Work on Identifier Renaming
• Empirical Studies
• Arnoudova et al. – Rename taxonomy to classify the semantics updates to a name;
developer study showing renaming is not always straightforward
• Peruma et al. – Multiple studies that examine the structure and meaning of names →
developers frequently narrow the meaning of the name; grammar patterns;
contextualization; taxonomy for digits in a name
• Recommendation Models
• Allamanis et al. – utilizes statistical NLP to learn the coding style of a codebase
• Suzuki et al. – An n-gram-based approach for assessing the comprehensibility of method
names and recommending intelligible method names
• Liu et al. – Deep learning techniques to provide recommendations based on the overlap
between method bodies and names that are close in a vector space
• …
While there are studies that investigate (or involve) rename refactoring’s, this is the first
study that examines a chain of rename operations for an identifier

Goal & Impact
Understand the evolution of identifier
names by constructing and studying the
characteristics of a chain of renames
for identifiers (i.e., a rename chain)
Facilitate the research and development
of tools to aid in name appraisal and
recommendations

Research Questions
• RQ1: To what extent do identifiers undergo multiple rename
refactoring operations?
• Understand the volume and types of identifiers that undergo multiple renames
• RQ2: How frequently do renames occur within a rename chain, and
who is responsible for their creation?
• Gain insight into the developers performing the renames in the chain
• RQ3: How do the semantics of an identifier's name evolve in a
rename chain?
• Determine the lexical-semantic properties of names in the rename chain
• RQ4: To what extent can commit log messages help contextualize the
occurrence of rename chains?
• Identify the specific causes for developers to create rename chains

Contributions
A publicly available dataset of
rename chains for replication
or extension studies
An understanding of identifier
name evolution and a discussion
on avenues for future research

Experiment Design
Source dataset of rename refactorings and commit details Dataset of rename chains and their characteristics
Source Dataset: Used in prior refactoring research studies; renames mined using RefactoringMiner
Rename Chain Construction: For each project: sort renames using the author commit date; compare the
identifier’s old and new name by their fully qualified name
Part-of-Speech Tagging: Utilize a specialized identifier name part-of-speech tagger; tags for only the original and
last name in the rename chain
Topic Modeling: Commit messages associated with rename chains; preprocessing; Latent Dirichlet Allocation
RQ Analysis: Supplement our quantitative findings with qualitative examples from our dataset

RQ 1: To what extent do identifiers undergo multiple
rename refactoring operations?
• Identifier renaming is a common operation developers perform
• 285,786 operations: Methods: 26.50%; Parameters: 25.53%; Variables: 21.75%
• Most identifiers undergo a single rename in their lifetime
• 17,404 detected rename chains – Methods are likely to have a chain
• Methods: 30.73%; Variables: 23.47%; Classes: 16.85%
• A rename chain is usually short – composed of a median of 2 rename
instances; variables typically undergo around 3 renamings
• Rename chains tend to occur in projects frequently
• 83.71% of projects have rename chains
• Projects have a median of 9 identifiers undergoing multiple renames

RQ 2: How frequently do renames occur within a
rename chain, and who is responsible for their
creation?
Interval Analysis – duration between renames in the chain
• Median duration between renames is 2 days
• Attributes: 25 days; Classes: 19 days; Methods: 14 days; Variables: 7 days; Parameters:
2 days
• Duration between the first and last rename:
• Parameters have the lowest interval: 17 days
• Variables have the highest interval: 357 days
Developer Analysis – developers performing the renames
• Most chains have the same developer performing all renames: 62.05%
• Multi-developer chains have around 2 developers involved
• Attribute chains have the most number of developers: 4
• 11.51% of chains have different developers for the first and last rename

RQ 3: How do the semantics of an identifier's name
evolve in a rename chain?
Analysis of the lexical-semantic structure (i.e., part-of-speech tags) of
names in the rename chain
Analysis is limited to only the first and last names in the chain
• The same part-of-speech tags are used for the original and new name
• TestServlet → TheTestServlet → TestServlet : NM-N → DT-NM-N → NM-N
• Developers utilize standard naming structures:
• Class: NM-NM-N → NM-NM-N
• Attribute: NM-N → NM-N
• Method: V-NM-N → V-NM-N
• Parameter & Variable: N → N
• Usually, the original and new names are not identical – 78%

RQ 4: To what extent can commit log messages help
contextualize the occurrence of rename chains?
An automated analysis of rename chain commit log messages
using Latent Dirichlet Allocation
3 high-level topics associated with these messages:
• Code Cleanup – developer improving code style quality by
adhering to standards – ‘naming’, ‘convention’, ‘whitespace’
• “Lots of fixes using Checkstyle - Fixed some names to follow conventions...”
• Refactoring – developers updating the code related to the
behavior and design of the system – ‘refactor’, ‘updated’, ‘revert’
• “Major refactor to start process of eventually moving content manager classes…”
• Bug Fix/Testing – renames are part of either a bug fix or unit
testing – ‘fix’, ‘bug’, ‘test’, ‘testcase’
• “fixed bug with searching for transitive dependencies + added test for it…”

Overall Findings
Renaming is a common activity in software implementation
• Most identifiers typically undergo a single rename
• However, rename chains frequently occur in projects – methods are frequently associated
with rename chains
• Renames in a chain occur days apart – variables typically having the shortest duration
(approx. two days) and attributes the longest
• Renames in a rename chain are usually performed by the same developer
• Multi-developer chains usually involve two developers
• The grammatical structure of the initial and last name in the chain remains the same
• Code Cleanup, Refactoring, and Bug Fix/Testing are the cause for rename chains –
However, these topics are at a high-level due to the nature of commit messages

Key Takeaways
• Part-of-speech tags are an efficient means of studying the semantic
updates a name undergoes when renamed
• Academia and practitioners should not limit their focus to only the words in a name
• Improvements to name recommendations and appraisal techniques
• These techniques should consider the historical evolution of an identifier’s name in
their evaluation process
• Emphasis on the importance of using high-quality names
• Academia should instill in students the importance high-quality names in the source
code; practitioners should incorporate naming quality into code reviews
• Challenges with automated contextualization of rename chains
• Current NLP techniques are not adequate for analyzing software engineering artifacts

Conclusion & Future Work
• Interpreting identifier names form the backbone of any code
comprehension task
• Developers perform renames to correct poor-quality names
• This can continue throughout the system’s lifetime
• We analyze multiple renames applied to a single identifier (i.e., rename
chain)
• Almost all projects exhibit this phenomenon, with an average chain size of two
renames
• We report on characteristics such as the interval between renames, developers
responsible for chain construction, grammatical changes, and motivation
• Future work: Human subject study
• Validate our empirical findings with developers of varying experience and skills

Thank You!
Anthony Peruma
https://www.peruma.me

Rename Chains: An Exploratory Study on the Occurrence and Characteristics of Identifiers Undergoing Multiple Renamings

Recommended

More Related Content

Similar to Rename Chains: An Exploratory Study on the Occurrence and Characteristics of Identifiers Undergoing Multiple Renamings (20)

More from University of Hawai‘i at Mānoa (20)

Recently uploaded (20)

Rename Chains: An Exploratory Study on the Occurrence and Characteristics of Identifiers Undergoing Multiple Renamings