[FSE18] Deep Learning Type Inference
[FSE18] Deep Learning Type Inference
152
ESEC/FSE ’18, November 4–9, 2018, Lake Buena Vista, FL, USA V. J. Hellendoorn, C. Bird, E. T. Barr, and M. Allamanis
Figure 1: Loosely aligned example of JavaScript code and the TypeScript equivalent with corresponding type annotations.
While developers may benefit from typed dialects of JS such as allow early detection of faults [14, 17, 27]. A stronger type sys-
TS, the migration path from JS to TS is challenging as it requires tem is also useful for software development tools, for instance
annotating existing codebases, which may be large. This is a time- improving auto-completion accuracy and debugging information.
consuming and potentially error-prone process. Fortunately, the Although virtually every programming language has a type system,
growth of TS’ popularity in the open-source ecosystem gives us the type information can be difficult to infer at compile-time without
unique opportunity to learn type inference from real-world data: it type annotations in the code itself. As a result, dynamically typed
offers a dataset of JS-like code with type annotations, which can languages such as JavaScript (JS) and Python are often at a dis-
be converted into an aligned training corpus of code and its corre- advantage. At the same time, using type annotations comes with
sponding types. We use this data to train DeepTyper, which uses a type annotation tax [14], paid when adding type annotations,
deep learning on existing typed code to infer new type annotations navigating around them while reading code, and wrestling with
for JS and TS. It learns to annotate all identifiers with a type vector: type errors. Perhaps for these reasons, developers voted with their
a probability distribution over types, which we use to propose types keyboards at the beginning of 21st century and increasingly turned
that a verifier can check and relay to a user as type suggestions. to dynamically typed languages, like Python and JS.
In this work, we demonstrate the general applicability of deep
learning to this task: it enriches conventional type inference with 2.1 A Middle Ground
a powerful intuition based on both names and (potentially exten- Type annotations have not disappeared, however: adding (partial)
sive) context, while also identifying the challenges that need to type annotations to dynamically typed programming languages
be addressed in further research, mainly: established models (in has become common in modern software engineering. Python 3.x
our case, deep recurrent neural networks) struggle to carry depen- introduced type hints via its typings package, which is now
dencies, and thus stay consistent, across the length of a function. widely used, notably by the mypy static checker [2]. For JS, multiple
When trained, DeepTyper can infer types for identifiers that the solutions exist, including Flow [1] and TypeScript [5]. Two of the
compiler’s type inference cannot establish, which we demonstrate largest software houses — Facebook and Microsoft — have invested
by replicating real-world type annotations in a large body of code. heavily in these two offerings, which is a testament to the value
DeepTyper suggests type annotations with over 80% precision at industry is now placing on returning to languages that provide
recall values of 50%, often providing either the correct type or typed-backed assurances. These new languages differ from their
at least narrowing down the potential types to a small set. Our predecessors: to extend and integrate with dynamic languages, their
contributions are three-fold: type systems permit programs to be partially annotated,1 not in
(1) A learning mechanism for type inference using an aligned the sense that some of the annotations can be missing because
corpus and differentiable type vectors that relax the task of they can be inferred, but in the sense that, for some identifiers, the
discrete type assignment to a continuous, real-valued vector correct annotation is unknown. When an identifier’s principal type
function. is unknown, these type systems annotate that identifier with an
(2) Demonstration of both the potential and challenges of using implicit any, reflecting a lack of knowledge of the identifier’s type.
deep learning for type inference, particularly with a proposed Their support for partial typing makes them highly deployable in
enhancement to existing RNNs that increases consistency JS shops, but taking full advantage of them still requires paying the
and accuracy of type annotations across a file. annotation tax, and slowly replacing anys with type annotations.
(3) A symbiosis between a probabilistic learner and a sound One of these languages is TypeScript (TS): a statically typed
type inference engine that mutually enhances performance. superset of JS that transpiles to JS, allowing it to be used as a drop-
We also demonstrate a mutually beneficial symbiosis with in replacement for JS. In TS, the type system includes primitive
JSNice [28], which tackles a similar problem. types (e.g. number, string) user-defined types (e.g. Promise,
HTMElement), combinations of these and any. TS comes with
2 PROBLEM STATEMENT compile-time type inference, which yields some of the benefits
of a static type system but is fundamentally limited in what it
A developer editing a file typically interacts with various kinds of
can soundly infer due to JS features like duck-typing. Consider
identifiers, such as names of functions, parameters and variables.
Each of these lives in a type system, which constrains operations 1 We avoid calling these type systems gradual because they violate the clause of the
to take only operands on which they are defined. Knowledge of gradual guarantee [30] that requires them to enforce type invariants, beyond those
the type at compile-time can improve the code’s performance and they get “for free” from JS’ dynamic type system, at runtime.
153
Deep Learning Type Inference ESEC/FSE ’18, November 4–9, 2018, Lake Buena Vista, FL, USA
the JS code on the left-hand side of Figure 1: the type for p may Tasks like these are amenable to sequence-to-sequence models, in
be inferred from the call to createElement, which returns an which a sequence of tokens is transformed into a sequence of types
HTMLElement.2 On the other hand, the type of cssText is al- (in our case) [32]. Specifically, our task is a sequence of annotation
most certainly string, but this cannot soundly be inferred from tasks, in which all elements st in an input sequence s 1 . . . s N need
its usage here.3 For such identifiers, developers would need to add to be annotated. Therefore, when approaching this problem with
type annotations, as shown in the TS code on the right. probabilistic machine learning, the modeled probability distribution
is P (τ0 . . . τ N |s 0 . . . s N ), where τi represents the type annotation
2.2 Type Suggestion of si . In our case, the annotations are the types for the tokens in the
To the developer wishing to transition from the code on the left to input, where we align tokens that have no type (e.g. punctuation,
that on the right in Figure 1, a tool that can recommend accurate keywords) with a special no-type symbol.
type annotations, especially where traditional type inference fails, Although deriving type annotations has many similarities to POS
would be helpful. This type suggestion task of easing the transition tagging and NER, it also presents some unique characteristics. First,
from a partially to a fully typed code-base is the goal of our work. our tasks has a much larger set of possible type annotations. The
We distinguish two objectives for type suggestion: widely used Penn Treebank Project uses only 36 distinct parts-of-
speech tags for all English words, while we aim to predict more than
(1) Closed-world type suggestion recommends annotations to the
than 11,000 types (Section 4.2). Furthermore, NLP tasks annotate a
developer from some finite vocabulary of types, e.g. to add
single instance of a word, whereas we are interested in annotating
to declarations of functions or variables.
a variable that may be used multiple times, and the annotations
(2) Open-world type suggestion aims to suggest novel types to
ought to be consistent across occurrences.
construct that reflect computations in the developer’s code.
As a first step to assisting developers in annotating their code, we
restrict ourselves to the first task and leave the second to future 3.1 A Neural Architecture for Type Inference
work. Specifically, our goal is to learn to recommend the (ca. 11,000) Similar to recent models in NLP, we turn to deep neural networks
most common types from a large corpus of code, including those for our type inference task. Recurrent Neural Networks (RNN) [9,
shown in Figure 1. To achieve this, we view the type inference 15, 19] have been widely successful at many natural language an-
problem as a translation problem between un-annotated JS/TS and notation tasks such as named entity recognition [12] and machine
annotated TS. We chose to base our work on TS because, as a translation [9]. RNNs are neural networks that work on sequences
superset of JS, it is designed to displace JS in developers’ IDEs. of elements, such as words, making them naturally fit for our task.
Thus, a growing body of projects have already adopted it (including The general family of RNNs is defined over a sequence of elements
s 1 . . . s N as ht = RNN xst , ht −1 where xst is a learned represen-
well known projects such as Angular and Reactive Extensions) and
we can leverage their code to train DeepTyper. We can use TS’ tation (embedding) of the input element st and ht −1 is the previous
compiler to automatically generate training data consisting of pairs output state of the RNN. The initial state h0 is usually set to a
of TS without type annotations and the corresponding types for null vector (0). Both x and h are high dimensional vectors, whose
DeepTyper’s training. dimensionality is tunable: higher dimensions allow the model to
Here, the fact that we are translating between two such closely capture more information, but also increase the cost of training and
related languages is a strength of our approach, easing the align- may lead to overfitting.
ment problem [9, 16] and vastly reducing the search space our As we feed input tokens to the network in order, the vector x
models must traverse.4 We train our translator, DeepTyper, on TS for each token is its representation, while h is the output state of
inputs (Figure 1, right), then test it on JS (left) or partially annotated the RNN based on both this current input and its previous state.
TS files. DeepTyper suggests variable and function annotations, Thus, RNNs can be seen as networks that learn to “summarize” the
consisting of return, local variable, and parameter types. input sequence s 1 . . . st with ht . There are many different imple-
mentations of RNNs; in this work, we use GRUs (Gated Recurrent
3 METHOD Unit) [9]. For a more extensive discussion of RNNs, we refer the
To approach type inference with machine learning, we are inspired reader to Goodfellow et al. [15].
by existing natural language processing (NLP) tasks, such as part-of- In general translation tasks (e.g. English to French), the length
speech (POS) tagging and named entity recognition (NER) [12, 20]. and ordering of words in the input and output sequence may be
In those tasks, a machine learning model needs to infer the role of different. RNN-based translation models account for these changes
a given word from its context. For example, the word “mail” can be by first completely digesting the input sequence, then using their
either a verb or a noun when viewed in isolation, but when given final state (typically plus some attention mechanism [26]) to con-
context in the sentence “I will mail the letter tomorrow”, the part struct the output sequence, token by token. In our case, however,
of speech becomes apparent. To solve this ambiguity, NLP research the token and type sequence are perfectly aligned, allowing us to
has focused on probabilistic methods that learn from data. treat our suggestion task as a sequence annotation task, also used
for POS tagging and NER. In this setting, for every input token that
2 Provided the type of ownerDocument is Document, which may itself require an we provide to the RNN, we also expect an output type judgement.
annotation. Since the RNN does not have to digest the full input before mak-
3 Grammatically, it could e.g. be number.
4 Vasilescu et al.’s work on deobfuscating JS also successfully leverages machine trans- ing type judgements, using this precise alignment can yield better
lation between two closely related languages [33]. performance.
154
ESEC/FSE ’18, November 4–9, 2018, Lake Buena Vista, FL, USA V. J. Hellendoorn, C. Bird, E. T. Barr, and M. Allamanis
155
Deep Learning Type Inference ESEC/FSE ’18, November 4–9, 2018, Lake Buena Vista, FL, USA
4.2 Data
Tokens Data Collection. We collected the 1,000 top starred open-source
projects on Github that predominantly consisted of TypeScript
Figure 3: Overview of the experimental setup, consisting of code on February 28, 2018; this is a similar approach to Ray et al.’s
three phases: data gathering, learning from aligned types, study of programming languages [27]. Each project was parsed
and evaluation. with the TypeScript compiler tsc, which infers type information
(possibly any) for all occurrences of each identifier. We removed
all files containing more than 5,000 tokens for the sequences to fit
three categories of identifiers that allow optional7 type annotations: within a minibatch used in our deep learner. This removed only
function return types, function parameters and variables. Deep- a small portion of both files (ca. 0.9%) and tokens (ca. 11%). We
Typer learns to suggest these by learning to assign a probability also remove all projects containing only TypeScript header files,
distribution over types, denoted a type vector, to each identifier which especially includes all projects from the ‘DefinitelyTyped’
occurrence in a file. To improve training, we do not only learn to eco-system. After these steps, our dataset contains 776 TypeScript
assign types to definition sites, where the annotation would be projects, with statistics listed in Table 1.
added, but to all occurrences of an identifier. This helps the deep Our dataset was randomly split by project into 80% training data,
learner include more context in its type judgements and allows 10% held-out (or validation) data and 10% test data. Among the
us to enforce its additional consistency constraint as described in largest projects included were Karma-Typescript (a test framework
Section 3. for TS), Angular and projects related to Microsoft’s VS Code. We
The model is presented with the code as sequential data, with focus only on inter-project type suggestion, because we believe
each token aligned with a type. Each token and type are encoded in this to be the most realistic use of our tool. That is, the model is
their respective vocabularies (see Section 4.2) as a one-hot vector trained on a pre-existing set of projects and then used to provide
(with a one at the index of the correct token/type and zeros oth- suggestions in a different/new project that was not seen during
erwise). The type may be a (deterministically assignable) no-type training. Future work may study an intra-project setting, in which
for tokens such as punctuation and keywords; we do not train the the model can benefit from project-specific information, which will
algorithms to assign these. Given a sequence of tokens, the model likely improve type suggestion accuracy.
is tasked to predict the corresponding sequence of types.
Token and Type Vocabularies. As is common practice in natural
At training time, the model’s accuracy is measured in terms of
language processing, we estimate our vocabularies on the training
the cross-entropy between its produced type vector and the true,
split and replace all the rare tokens (in our case, those seen less than
one-hot encoded type vector. At test time, the model is tasked with
10 times) and all unseen tokens in the held-out and test data with
inferring the correct annotations at the locations where developers
a generic UNKNOWN token. Note that we still infer types for these
originally added type annotations that we removed to produce our
tokens, even though their name provides no useful information
aligned data. Although the model infers types for all occurrences
to the deep learner. To reduce vocabulary size, we also replaced
of every identifier (because of the way it is trained), we report our
all numerals with ‘0’, all strings with “s" and all templates with
results on the true original type annotations both because this is
a simple ‘template‘, none of which affects the types of the code.
the most realistic test criterion and to avoid confusion.8
The type vocabulary is similarly estimated on the types of the
We evaluate the model primarily in terms of prediction accuracy:
training data, except that rare types (again, those seen less than 10
the likelihood that the most activated element of the type vector
times in the training data) and unseen types are simply treated as
is the correct type. We focus on assigning non-any types (recall
any. The number of tokens and types strongly correlates with the
that any expresses uncertainty about a type), since those will be
complexity of the model, so we set the vocabulary cut-off as low as
most useful to a developer. We furthermore distinguish between
was possible while still making training feasible in reasonable time
evaluating the accuracy at all identifier locations (including non-
and memory. The resulting vocabularies consist of 40,195 source
definition sites, as we do at training time) and inferring only at
tokens and 11,830 types.
those positions where developers actually added type annotations
in our dataset. For more details, see Section 4.4. Aligning Data. To create an oracle and aligned corpus, we use
the compiler to add type annotations to every identifier. We then
7 Here, remove all type annotations from the TS code, in order to create
any is implicit if no annotation is added.
8 In
brief, across all identifiers, DeepTyper reaches accuracies close to that of the code that more closely resembles JS code. Note that this does not
compiler’s type inference and a hybrid of the two was able to yield superior results. always produce actual JS code since TS includes a richer syntax
156
ESEC/FSE ’18, November 4–9, 2018, Lake Buena Vista, FL, USA V. J. Hellendoorn, C. Bird, E. T. Barr, and M. Allamanis
beyond just type annotations. We create two types of oracle datasets complement CheckJS by providing plausible, verifiable recommen-
from this translation: dations precisely where the compiler is uncertain.
(1) ALL identifier data (training): we create an aligned cor-
pus between tokens and types, in which every occurrence JSNice. In our final experiment, we compare the deep learner’s
of every identifier has a type annotation from the compiler. performance with that of JSNice [28]. JSNice was proposed as a
This is the type of oracle data that we use for training. This method to (among others) learn type annotations for JavaScript
data likely includes more types than a developer would want from dependencies between variables, so we thought it instructive
to annotate, since many could be inferred by the compiler. to compare and contrast performances. A perfect comparison is not
(2) GOLD, annotation-only data (testing): we align only the possible as JSNice differs from our work in several fundamental
types that developers annotated with the declaration site ways: (1) it focuses on JavaScript code only whereas our model
where the annotation was added. All other tokens are aligned is trained on TypeScript code with a varying degree of similarity
with a no-type. This provides the closest approximation of to plain JavaScript, (2) it assigns a limited set of types, includ-
the annotations that developers care about and serves as our ing number, string, Object, Array, a few special cases of
test data. Object such as Element and Document, and ? (unsure), and
(3) it requires compiler information (e.g. dependencies, scoping),
4.3 Experiments and Models whereas our approach requires just an aligned corpus and is other-
wise language-agnostic.
DeepTyper. We study the accuracy and behavior of deep learning
networks when applied to type inference across a range of metrics
(see Section 4.4). Our proposed model enhances a conventional 4.4 Metrics
RNN structure with a consistency layer as described in Section 3 We evaluate our models on the accuracy and consistency of their
and is denoted DeepTyper. We compare this model against a plain predictions. Since a prediction is made at each identifier’s occur-
RNN with the same architecture minus the consistency layer. rence, we first evaluate each occurrence separately. We measure the
For our RNNs, we use 300-dimensional token embeddings, which rank of the correct prediction and extract top-K accuracy metrics.
are trained jointly with the model, and two 650-dimensional hidden We evaluate the various models’ performances on real-world type
layers, implemented as a bi-directional network with two GRUs annotations (the GOLD data). Unless otherwise stated, we only
each (Section 3). This allows information to flow forward and back- focus on suggesting the non-any types in our aligned datasets,
ward in the model and improves its accuracy. Finally, we use drop- since inferring any is generally not helpful. The RNN also emits a
out regularization [31] with a drop-out probability of 50% to the probability with its top prediction, which can be used to reflect its
second hidden layer and apply layer-normalization after the embed- “confidence” at that location. This can be used to set a minimum
ding layer. As is typical in NLP tasks like this, the token sequence confidence threshold, below which DeepTyper’s suggestions are
is padded with start- and end-of-sentence tags (with no-type) as not presented. Thus, we also show precision/recall results when
cues for the model. varying this confidence threshold for DeepTyper. Finally, we are
We train the deep learner for 10 epochs with a minibatch size interested in how consistent the model is in its assignment of types
of up to five thousand tokens, requiring ca. 4,100 minibatches per to identifiers across their definition and usages in the code. Let X be
epoch. We use a learning configuration that is typical for these the set of all type-able identifiers that occur more than once in some
tasks in NLP settings and fine-tuned our hyper-parameters using code of interest. For DT : X → N, let DT (x) denote the number of
our validation data. We use an Adam optimizer [23]; we initialize types DeepTyper assigns to x, across all of its appearances. Ideally,
the learning rate to 10−3 and reduce it every other epoch until it ∀x ∈ X , DT (x) = 1; indeed, this is a constraint that standard type
reaches 10−4 where it remains stable; we set momentum to 1/e inference obeys. Like all stochastic approaches, DeepTyper is not
after the first 1,000 minibatches and clip total gradients per sample so precise. Let Y , {x | DT (x) > 1, ∀x ∈ X }. Then the type incon-
to 15. Validation error is computed at every epoch and we select sistency measure of a type inference approach, like DeepTyper, that
the model when this error stabilizes; for both of our RNN models, does not necessarily find the principal type of a variable across all
this occurred around epoch 5. |Y |
of its uses, is: |X | .
TSC + CheckJS. In the second experiment, we compare our deep
learning models against those types that the TypeScript compiler 4.5 Experimental Setup
(tsc) could infer (after removing type annotations), when also
equipped with a static type-inference tool for JavaScript named The deep learning code was written in CNTK [4]. All experiments
CheckJS.9 CheckJS reads JavaScript and provides best effort type are conducted on an NVIDIA GeForce GTX 1080 Ti GPU with 11GB
inference, assigning any to those tokens to which it cannot assign of graphics memory, in combination with an 6-core Intel i7-8700
a more precise type. Since TSC+CheckJS (hereafter typically ab- CPU with 32GB of RAM. Our resulting model requires ca. 500MB of
breviated “CheckJS”) has access to complete compiler and build RAM to be loaded into memory and can be run on both a GPU and
information of the test projects (while DeepTyper is evaluated in an CPU. It can provide type annotations for (reasonably sized) files in
inter-project setting), our main aim is not to outperform CheckJS well under 2 seconds.
but rather to demonstrate how probabilistic type inference can Our algorithm is programmatically exposed through a web API
(Figure 4) that allows users to submit JavaScript or TypeScript files
9 see https://github.com/Microsoft/TypeScript/wiki/Type-Checking-JavaScript-Files and annotates each identifier with its most likely types, subject to a
157
Deep Learning Type Inference ESEC/FSE ’18, November 4–9, 2018, Lake Buena Vista, FL, USA
158
ESEC/FSE ’18, November 4–9, 2018, Lake Buena Vista, FL, USA V. J. Hellendoorn, C. Bird, E. T. Barr, and M. Allamanis
60%
0% or all types can be overruled by DeepTyper, and the mini-
40% mum confidence for DeepTyper to act. Results for CJ and
DT by themselves are shown independent of threshold for
20% clarity (and are thus identical in their columns).
159
Deep Learning Type Inference ESEC/FSE ’18, November 4–9, 2018, Lake Buena Vista, FL, USA
Table 5: Comparison of DeepTyper, JSNice, and Hybrid of no additional incorrect or partially correct annotations. At the 0%
both across thirty randomly selected JavaScript functions. threshold, the Hybrid model is more than 15% points more likely
to be correct than either model separately, while introducing fewer
Correct Partial Incorrect Unsure errors than DeepTyper would by itself.
Qualitatively, we find that DeepTyper particularly outperforms
JSNice [28] 51.9% 1.9% 0.9% 45.4% JSNice when the type is intuitively clear from the context, such
DeepTyper ≥ 0% 55.6% 2.8% 6.5% 35.2% as for cssText in Figure 1. It expresses high confidence (and
DeepTyper ≥ 50% 51.9% 0.9% 2.8% 44.4% corresponding accuracy) in tokens who’s name include cues to-
DeepTyper ≥ 90% 35.2% 0.0% 0.0% 64.8% wards their type (e.g. “name” for string) and/or are used in id-
Hybrid ≥ 0% 71.3% 3.7% 4.6% 20.4% iomatic ways (e.g. concatenation with another string, or invok-
Hybrid ≥ 50% 70.4% 1.9% 1.9% 25.9% ing element-related methods on HTMLElement-related types).
Hybrid ≥ 90% 64.8% 1.9% 0.9% 32.4% JSNice often declares uncertainty on these because of a possibly
ambiguous type (e.g. string concatenation does not imply that the
their body. Thus each function requires two type annotations at right-hand argument is a string, and other classes may have
the very least. Because the evaluation had to be performed manu- declared similarly named methods). Vice versa, when JSNice does
ally, we examined thirty JavaScript functions.11 For each function infer a type, it is very precise: whereas DeepTyper often gravitates
we manually determined the correct types to use as an oracle for to a subtype or supertype (especially any, if a variable is used in
evaluation and comparison, assigning any if no conclusive type several far-apart places) of the true type, JSNice was highly accu-
could be assigned. As a result, we identified 167 annotations (on rate when it did not declare uncertainty and was able to include
function return types, local variables, parameters and attributes) of information (such as dataflow connections) from across the whole
which 108 were clearly not any types. function, regardless of size. Altogether, our results demonstrate
Again, we focus on predicting only the non-any types, since that these two methods excel at different locations, with JSNice
these are most likely to be helpful to the user. Cases in which JSNice benefiting from its access to global information and DeepTyper
predicted ? or Object, and cases where DeepTyper predicted any from its powerful learned intuition.
or was not sufficiently confident are all treated as “Unsure”. We
again show results for various confidence thresholds for DeepTyper
(across a slightly lower range than before, to better match the
6 DISCUSSION
“Unsure” rate of JSNice) and include another hybrid model, in which 6.1 Learning Type Inference
DeepTyper may attempt to “correct” any cases in which both JSNice Type inference is traditionally an exact task, and for good reason:
was uncertain (or did not annotate a type at all) and DeepTyper is unsound type inference risks breaking valid code, violating the
sufficiently confident. The results are shown in Table 5. central law of compiler design. However, sound type inference
At the lowest threshold, DeepTyper gets both more types correct for some programming languages can be greatly encumbered by
and wrong than JSNice, whereas at the highest threshold it makes features of the language design (such as eval() in JS). Although the
no mistakes at all while still annotating more than one-third of TypeScript compiler with CheckJS achieved good accuracy in our
the types correctly. JSNice made one mistake, in which it assigned experiments in which it had access to the full project, it could still
a type that was too specific given the context.12 We also count be improved substantially by probabilistic methods, particularly at
“partial” correctness, in which the type given was too specific, but the many places where it only inferred any. With partial typing
close to the correct type. This includes cases in which both JSNice now an option in languages such as TypeScript and Python, there
and DeepTyper suggest HTMLElement instead of Element. is a need for type suggestion engines, that can assist programmers
Overall, DeepTyper’s and JSNice’s performances are very similar in enriching their code with type information, preferably in a semi-
on this task, despite DeepTyper having been trained primarily on automatic way.
TypeScript code, using a larger type vocabulary and not requir- A key insight of our work is that type inference can be learned
ing any information about the snippet beyond its tokens. The two from an aligned corpus of tokens and their types, and such an aligned
approaches are also remarkably complementary. JSNice is almost corpus can be obtained fully automatically from existing data. This
never incorrect when it does provide a type, but it is more often un- is similar to recent work by Vasilescu et al., who use a JavaScript
certain, not providing anything, whereas DeepTyper makes more obfuscator to create an aligned corpus of real-world code and its
predictions, but is incorrect more often than JSNice. A Hybrid ap- obfuscated counter-part, which can then be reversed to learn to de-
proach in which JSNice is first queried and DeepTyper is used obfuscate [33], although they did not approach this as a sequence
if JSNice cannot provide a type shows a dramatic improvement tagging problem. This type of aligned corpus (e.g. text annotated
over each approach in isolation and demonstrates that JSNice and with parse tags, named entities) is often a costly resource in natural
DeepTyper work well in differing contexts and for differing types. language processing, requiring substantial manual effort, but comes
When using a 90% confidence threshold, the Hybrid model boosts all-but free in many software related tasks, primarily because they
the accuracy by 12.9% points (51.9% to 64.8%) while introducing involve formal languages for which interpreters and compilers exist.
11 The source of these functions and the functions themselves will be released after As a result, vast amounts of training data can be made available
anonymity is lifted for tasks such as these, to great benefit of models such as the deep
12 We found several more such cases among variables who’s true type was deemed
learners we used.
any and are thus not included in this table.
160
ESEC/FSE ’18, November 4–9, 2018, Lake Buena Vista, FL, USA V. J. Hellendoorn, C. Bird, E. T. Barr, and M. Allamanis
6.2 Strengths and Weaknesses of the RNN primitive types of JavaScript code by learning from a corpus. JSNice
We have shown that RNN-based models can learn a remarkably builds a dependency network among variables and learns statistical
powerful probabilistic notion of types through differentiable type correlations that predict the type. In contrast to this work, our
vectors. This probabilistic perspective on types is a necessity for deep learner considers a much wider context than is defined by
training these models and raises an interesting challenge: at once JSNice’s dependency network and aims to predict a larger set of type
the models can deliver a highly accurate source of type guesses, annotations. The work of Xu et al. [34] uses probabilistic inference
while at the same time not being able to make any guarantees for Python and defines a probabilistic method for fusing information
regarding the soundness of even its most confident annotations. from multiple sources such as attribute accesses and variable names.
For instance, if the RNN sees the phrase “var x = 0”, it may However, this work does not include a learning component but
deem the (clearly correct) type ‘number’ for ‘x’ highly accurate, rather uses a set of hand-picked weights on probabilistic constraints.
but not truth (i.e. assign it a probability very close to 1). A hybrid Both these works rely on factor graphs for type inference, while, in
approach provides a solution: when DeepTyper offers plausible and this work, we avoid the task of explicitly building such a graph by
natural type annotation suggestions, the type checker can verify directly exploiting the interaction of a strong deep neural network
these, thus preserving soundness, similar to how a programmer and a pre-existing type checker.
might annotate code. It is also interesting to ask if we can teach deep Applying machine learning to source code is not a new idea.
learning algorithms some of these abilities. Provable correctness Hindle et al. [18] learned a simple n-gram language model of code
is not out of the scope of these models, as was demonstrated for to assist code completion. Raychev et al. [28] developed a proba-
neural program inference using recursion [11]. bilistic framework for predicting program properties, such as types
DeepTyper’s probabilistic nature also leads to an intriguing kind or variable names. Other applications include deobfuscation [10],
of “type drift", also visible in our web tool, in which the probabilities coding conventions [7, 28] and migration [21, 25]. Vasilescu et al.
in a variable’s type vector change throughout its definition and specifically employ machine learning to an aligned corpus within
use in the code, even though its true type is fixed. We partially the same language, using an obfuscator to learn de-obfuscation of
mitigated this limited awareness of the global accuracy of its type JavaScript [33]. Their work is closely related to ours, although our
assignments by equipping the model with information that is lexi- approach works both within TypeScript and can enhance JavaScript
cally far away and saw gains in consistency and performance. Still, code into TypeScript code because the latter is a superset of the
a substantial number of consistency errors remain, allowing room former. Furthermore, our work learns to translate information be-
for improvement over the deep learners used in this work if global tween domains: from tokens to their types, whereas de-obfuscation
and local optimization can be balanced. Such a combination need is only concerned with translation between identifiers.
not come from deep learning alone: the global model may be a sym-
biosis with a static type checker, or a method such as conditional 8 CONCLUSION
random fields [28]. Our work set out to study to what extent type annotations can
be learned from the underlying code and whether such learners
6.3 Extensions can assist programmers to lower the annotation tax. Our results
The aligned corpus in our work is one between TypeScript code and are positive: we showed that deep learners can achieve a strong,
the types for each identifier in this code. As such, our work only probabilistic notion of types given code that extends across projects
scratches the surface of what this free Rosetta Stone could give! and to both TypeScript and plain JavaScript code. We also high-
Type inference is only one step in the compilation process and many light their present flaws and hope to inspire research into further
other parts of TypeScript’s enhancements over JavaScript could be improvements. Even more promising is that DeepTyper proved
learned, including type definitions, classes, public/private modifiers, to be complementary to a compiler’s type inference engine on an
etc.. Even fully transpiling TypeScript to JavaScript can be used to annotation task, even when the latter had access to complete build
create an aligned corpus (although no longer token-aligned, and information. Jointly, they could predict thousands of annotations
with a fair degree of boiler-plate code) that we may, in due time, be with high precision. Our tool is also complementary with JSNice
able to exploit to learn to convert entire files. This methodology is [28] on plain JavaScript functions, which shows that our model
not bound to our current language either; an obvious extension is is learning new, different type information from prior work. Our
to partially typed Python code, but similar tasks in many languages findings demonstrate potential for learning traditional software
(e.g. inferring nullity) may well be highly amenable to a comparable engineering tasks, type inference specifically, from aligned corpora.
approach.
ACKNOWLEDGEMENTS
7 RELATED WORK Vincent Hellendoorn was partially supported by the National Sci-
ence Foundation, award number 1414172
Type inference is a widely studied problem in programming lan-
guage research. Inferring types for dynamic languages has become
REFERENCES
an important research area in light of the widespread use of lan-
[1] [n. d.]. flow. https://flow.org/.
guages such as JavaScript and Python, and recent moves to allow [2] [n. d.]. mypy. http://mypy-lang.org/.
partial typing of these [6, 13, 34]. [3] [n. d.]. Spectrum IEEE 2017 Top Programming Languages. http://spectrum.ieee.
org/computing/software/the-2017-top-programming-languages.
Probabilistic type inference, i.e. the use of probabilistic reasoning [4] [n. d.]. The Microsoft Cognitive Toolkit. https://www.microsoft.com/en-us/
for inferring types has received recent attention. JSNice [28] infers cognitive-toolkit/.
161
Deep Learning Type Inference ESEC/FSE ’18, November 4–9, 2018, Lake Buena Vista, FL, USA
[5] [n. d.]. TypeScript. https://www.typescriptlang.org/. International Symposium on New Ideas, New Paradigms, and Reflections on Pro-
[6] Martin Abadi, Cormac Flanagan, and Stephen N Freund. 2006. Types for safe lock- gramming & Software. ACM, 173–184.
ing: Static race detection for Java. ACM Transactions on Programming Languages [22] Vineeth Kashyap, Kyle Dewey, Ethan A Kuefner, John Wagner, Kevin Gibbons,
and Systems (TOPLAS) 28, 2 (2006), 207–255. John Sarracino, Ben Wiedermann, and Ben Hardekopf. 2014. JSAI: A static analy-
[7] Miltiadis Allamanis, Earl T Barr, Christian Bird, and Charles Sutton. 2014. Learn- sis platform for JavaScript. In Proceedings of the 22nd ACM SIGSOFT International
ing natural coding conventions. In Proceedings of the 22nd ACM SIGSOFT Interna- Symposium on Foundations of Software Engineering. ACM, 121–132.
tional Symposium on Foundations of Software Engineering. ACM, 281–293. [23] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic opti-
[8] Esben Andreasen and Anders Møller. 2014. Determinacy in static analysis for mization. arXiv preprint arXiv:1412.6980 (2014).
jQuery. In ACM SIGPLAN Notices, Vol. 49. ACM, 17–31. [24] Leo A Meyerovich and Ariel S Rabkin. 2012. Socio-PLT: Principles for program-
[9] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine ming language adoption. In Proceedings of the ACM international symposium on
translation by jointly learning to align and translate. In ICLR 2015. New ideas, new paradigms, and reflections on programming and software. ACM,
[10] Benjamin Bichsel, Veselin Raychev, Petar Tsankov, and Martin Vechev. 2016. 39–54.
Statistical deobfuscation of Android applications. In Proceedings of the 2016 ACM [25] Anh Tuan Nguyen, Hoan Anh Nguyen, Tung Thanh Nguyen, and Tien N Nguyen.
SIGSAC Conference on Computer and Communications Security. ACM, 343–355. 2014. Statistical learning approach for mining api usage mappings for code migra-
[11] Jonathon Cai, Richard Shin, and Dawn Song. 2017. Making neural programming tion. In Proceedings of the 29th ACM/IEEE international conference on Automated
architectures generalize via recursion. arXiv preprint arXiv:1704.06611 (2017). software engineering. ACM, 457–468.
[12] Franck Dernoncourt, Ji Young Lee, and Peter Szolovits. 2017. NeuroNER: an [26] Chris Olah and Shan Carter. 2016. Attention and Augmented Recurrent Neural
easy-to-use program for named-entity recognition based on neural networks. Networks. Distill (2016). https://doi.org/10.23915/distill.00001
arXiv preprint arXiv:1705.05487 (2017). [27] Baishakhi Ray, Daryl Posnett, Vladimir Filkov, and Premkumar Devanbu. 2014.
[13] Michael Furr, Jong-hoon David An, Jeffrey S Foster, and Michael Hicks. 2009. A large scale study of programming languages and code quality in github. In
Static type inference for Ruby. In Proceedings of the 2009 ACM symposium on Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of
Applied Computing. ACM, 1859–1866. Software Engineering. ACM, 155–165.
[14] Zheng Gao, Christian Bird, and Earl T. Barr. 2017. To Type or not to Type: [28] Veselin Raychev, Martin Vechev, and Andreas Krause. 2015. Predicting program
On the Effectiveness of Static typing for JavaScript. In Proceedings of the 39th properties from big code. In ACM SIGPLAN Notices, Vol. 50. ACM, 111–124.
International Conference on Software Engineering. IEEE. [29] Koushik Sen, Swaroop Kalasapur, Tasneem Brutch, and Simon Gibbs. 2013.
[15] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Jalangi: A selective record-replay and dynamic analysis framework for Java-
Press. Script. In Proceedings of the 2013 9th Joint Meeting on Foundations of Software
[16] Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. Engineering. ACM, 488–498.
2006. Connectionist temporal classification: labelling unsegmented sequence [30] Jeremy G Siek, Michael M Vitousek, Matteo Cimini, and John Tang Boyland. 2015.
data with recurrent neural networks. In Proceedings of the 23rd international Refined criteria for gradual typing. In LIPIcs-Leibniz International Proceedings in
conference on Machine learning. ACM, 369–376. Informatics, Vol. 32. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.
[17] Stefan Hanenberg, Sebastian Kleinschmager, Romain Robbes, Éric Tanter, and [31] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan
Andreas Stefik. 2014. An empirical study on the impact of static typing on Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from
software maintainability. Empirical Software Engineering 19, 5 (01 Oct 2014), overfitting. The Journal of Machine Learning Research 15, 1 (2014), 1929–1958.
1335–1382. https://doi.org/10.1007/s10664-013-9289-1 [32] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning
[18] Abram Hindle, Earl T Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. with neural networks. In Advances in neural information processing systems. 3104–
2012. On the naturalness of software. In Software Engineering (ICSE), 2012 34th 3112.
International Conference on. IEEE, 837–847. [33] Bogdan Vasilescu, Casey Casalnuovo, and Premkumar Devanbu. 2017. Recovering
[19] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural clear, natural identifiers from obfuscated JS names. In Proceedings of the 2017 11th
computation 9, 8 (1997), 1735–1780. Joint Meeting on Foundations of Software Engineering. ACM, 683–693.
[20] Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF models for [34] Zhaogui Xu, Xiangyu Zhang, Lin Chen, Kexin Pei, and Baowen Xu. 2016. Python
sequence tagging. arXiv preprint arXiv:1508.01991 (2015). probabilistic type inference with natural language support. In Proceedings of the
[21] Svetoslav Karaivanov, Veselin Raychev, and Martin Vechev. 2014. Phrase-based 2016 24th ACM SIGSOFT International Symposium on Foundations of Software
statistical translation of programming languages. In Proceedings of the 2014 ACM Engineering. ACM, 607–618.
162