0% found this document useful (0 votes)
6 views11 pages

[FSE18] Deep Learning Type Inference

The document presents DeepTyper, a deep learning model designed for type inference in dynamically typed languages like JavaScript and Python. It leverages an aligned corpus of code and types to suggest type annotations, achieving over 80% precision in its recommendations. The work highlights the challenges of static typing and the benefits of using deep learning to enhance type inference, particularly in the context of transitioning from JavaScript to TypeScript.

Uploaded by

Rubin Wang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views11 pages

[FSE18] Deep Learning Type Inference

The document presents DeepTyper, a deep learning model designed for type inference in dynamically typed languages like JavaScript and Python. It leverages an aligned corpus of code and types to suggest type annotations, achieving over 80% precision in its recommendations. The work highlights the challenges of static typing and the benefits of using deep learning to enhance type inference, particularly in the context of transitioning from JavaScript to TypeScript.

Uploaded by

Rubin Wang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Deep Learning Type Inference

Vincent J. Hellendoorn∗ Christian Bird


University of California, Davis Microsoft Research
Davis, California, USA Redmond, Washington, USA
[email protected] [email protected]

Earl T. Barr Miltiadis Allamanis


University College London Microsoft Research Cambridge
London, UK Cambridge, UK
[email protected] [email protected]
ABSTRACT ACM Reference Format:
Dynamically typed languages such as JavaScript and Python are Vincent J. Hellendoorn, Christian Bird, Earl T. Barr, and Miltiadis Alla-
manis. 2018. Deep Learning Type Inference. In Proceedings of the 26th
increasingly popular, yet static typing has not been totally eclipsed:
ACM Joint European Software Engineering Conference and Symposium on
Python now supports type annotations and languages like Type- the Foundations of Software Engineering (ESEC/FSE ’18), November 4–9,
Script offer a middle-ground for JavaScript: a strict superset of 2018, Lake Buena Vista, FL, USA. ACM, New York, NY, USA, 11 pages.
JavaScript, to which it transpiles, coupled with a type system that https://doi.org/10.1145/3236024.3236051
permits partially typed programs. However, static typing has a cost:
adding annotations, reading the added syntax, and wrestling with
the type system to fix type errors. Type inference can ease the
1 INTRODUCTION
transition to more statically typed code and unlock the benefits of
richer compile-time information, but is limited in languages like Programming language use in real-world software engineering
JavaScript as it cannot soundly handle duck-typing or runtime eval- varies widely and the choice of a language often comes with strong
uation via eval. We propose DeepTyper, a deep learning model beliefs about its design and quality [24]. In turn, the academic com-
that understands which types naturally occur in certain contexts munity has devoted increasing attention to evaluating the practical
and relations and can provide type suggestions, which can often impact of important design decisions like the strength of the type
be verified by the type checker, even if it could not infer the type system, the trade-off between static/compile-time and dynamic/run-
initially. DeepTyper, leverages an automatically aligned corpus time type evaluation. The evidence suggests that static typing is
of tokens and types to accurately predict thousands of variable useful: Hanenberg et al. showed in a large scale user-study that
and function type annotations. Furthermore, we demonstrate that statically typed languages enhance maintainability and readability
context is key in accurately assigning these types and introduce a of undocumented code and ability to fix type and semantic errors
technique to reduce overfitting on local cues while highlighting the [17], Gao et al. found that having type annotations in JavaScript
need for further improvements. Finally, we show that our model could have avoided 15% of reported bugs [14], and Ray et al. em-
can interact with a compiler to provide more than 4,000 additional pirically found a modestly lower fault incidence in statically typed
type annotations with over 95% precision that could not be inferred functional languages in open-source projects [27].
without the aid of DeepTyper. At the same time, some of the most popular programming lan-
guages are dynamically, relatively weakly typed: Python, propelled
CCS CONCEPTS by interest in deep learning, has risen to the top of the IEEE Spec-
trum rankings [3]; JavaScript (JS) has steadily increased its foothold
• Software and its engineering → Software notations and
both in and out of web-development, for reasons including the
tools; Automated static analysis; • Theory of computation →
comprehensive package ecosystem of NodeJS. Achieving the ben-
Type structures;
efits of typing for languages like JS is the subject of much re-
search [8, 22]. It is often accomplished through dynamic analysis
KEYWORDS
(such as Jalangi [29]), as static type inference for these languages is
Type Inference, Deep Learning, Naturalness made complex by features such as duck-typing and JS’s eval().
∗ Work partially completed while author was an intern at Microsoft Research Several languages, including TypeScript (TS), have been devel-
oped that propose an alternative solution: they enhance an existing
Permission to make digital or hard copies of all or part of this work for personal or language with a type system that allows partial typing (allowing,
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation but not requiring, all variables to have type annotations), which
on the first page. Copyrights for components of this work owned by others than ACM can be transpiled back to the original language. In this way, TS can
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a
be used and compiled in the IDE, with all the benefits of typing, and
fee. Request permissions from [email protected]. can be transpiled into “plain” JS so that it can be used anywhere
ESEC/FSE ’18, November 4–9, 2018, Lake Buena Vista, FL, USA regular JS can. This lowers the threshold for typing existing code
© 2018 Association for Computing Machinery. while unlocking (at least partially) the benefits of compile-time type
ACM ISBN 978-1-4503-5573-5/18/11. . . $15.00
https://doi.org/10.1145/3236024.3236051 checking.

152
ESEC/FSE ’18, November 4–9, 2018, Lake Buena Vista, FL, USA V. J. Hellendoorn, C. Bird, E. T. Barr, and M. Allamanis

Figure 1: Loosely aligned example of JavaScript code and the TypeScript equivalent with corresponding type annotations.

While developers may benefit from typed dialects of JS such as allow early detection of faults [14, 17, 27]. A stronger type sys-
TS, the migration path from JS to TS is challenging as it requires tem is also useful for software development tools, for instance
annotating existing codebases, which may be large. This is a time- improving auto-completion accuracy and debugging information.
consuming and potentially error-prone process. Fortunately, the Although virtually every programming language has a type system,
growth of TS’ popularity in the open-source ecosystem gives us the type information can be difficult to infer at compile-time without
unique opportunity to learn type inference from real-world data: it type annotations in the code itself. As a result, dynamically typed
offers a dataset of JS-like code with type annotations, which can languages such as JavaScript (JS) and Python are often at a dis-
be converted into an aligned training corpus of code and its corre- advantage. At the same time, using type annotations comes with
sponding types. We use this data to train DeepTyper, which uses a type annotation tax [14], paid when adding type annotations,
deep learning on existing typed code to infer new type annotations navigating around them while reading code, and wrestling with
for JS and TS. It learns to annotate all identifiers with a type vector: type errors. Perhaps for these reasons, developers voted with their
a probability distribution over types, which we use to propose types keyboards at the beginning of 21st century and increasingly turned
that a verifier can check and relay to a user as type suggestions. to dynamically typed languages, like Python and JS.
In this work, we demonstrate the general applicability of deep
learning to this task: it enriches conventional type inference with 2.1 A Middle Ground
a powerful intuition based on both names and (potentially exten- Type annotations have not disappeared, however: adding (partial)
sive) context, while also identifying the challenges that need to type annotations to dynamically typed programming languages
be addressed in further research, mainly: established models (in has become common in modern software engineering. Python 3.x
our case, deep recurrent neural networks) struggle to carry depen- introduced type hints via its typings package, which is now
dencies, and thus stay consistent, across the length of a function. widely used, notably by the mypy static checker [2]. For JS, multiple
When trained, DeepTyper can infer types for identifiers that the solutions exist, including Flow [1] and TypeScript [5]. Two of the
compiler’s type inference cannot establish, which we demonstrate largest software houses — Facebook and Microsoft — have invested
by replicating real-world type annotations in a large body of code. heavily in these two offerings, which is a testament to the value
DeepTyper suggests type annotations with over 80% precision at industry is now placing on returning to languages that provide
recall values of 50%, often providing either the correct type or typed-backed assurances. These new languages differ from their
at least narrowing down the potential types to a small set. Our predecessors: to extend and integrate with dynamic languages, their
contributions are three-fold: type systems permit programs to be partially annotated,1 not in
(1) A learning mechanism for type inference using an aligned the sense that some of the annotations can be missing because
corpus and differentiable type vectors that relax the task of they can be inferred, but in the sense that, for some identifiers, the
discrete type assignment to a continuous, real-valued vector correct annotation is unknown. When an identifier’s principal type
function. is unknown, these type systems annotate that identifier with an
(2) Demonstration of both the potential and challenges of using implicit any, reflecting a lack of knowledge of the identifier’s type.
deep learning for type inference, particularly with a proposed Their support for partial typing makes them highly deployable in
enhancement to existing RNNs that increases consistency JS shops, but taking full advantage of them still requires paying the
and accuracy of type annotations across a file. annotation tax, and slowly replacing anys with type annotations.
(3) A symbiosis between a probabilistic learner and a sound One of these languages is TypeScript (TS): a statically typed
type inference engine that mutually enhances performance. superset of JS that transpiles to JS, allowing it to be used as a drop-
We also demonstrate a mutually beneficial symbiosis with in replacement for JS. In TS, the type system includes primitive
JSNice [28], which tackles a similar problem. types (e.g. number, string) user-defined types (e.g. Promise,
HTMElement), combinations of these and any. TS comes with
2 PROBLEM STATEMENT compile-time type inference, which yields some of the benefits
of a static type system but is fundamentally limited in what it
A developer editing a file typically interacts with various kinds of
can soundly infer due to JS features like duck-typing. Consider
identifiers, such as names of functions, parameters and variables.
Each of these lives in a type system, which constrains operations 1 We avoid calling these type systems gradual because they violate the clause of the
to take only operands on which they are defined. Knowledge of gradual guarantee [30] that requires them to enforce type invariants, beyond those
the type at compile-time can improve the code’s performance and they get “for free” from JS’ dynamic type system, at runtime.

153
Deep Learning Type Inference ESEC/FSE ’18, November 4–9, 2018, Lake Buena Vista, FL, USA

the JS code on the left-hand side of Figure 1: the type for p may Tasks like these are amenable to sequence-to-sequence models, in
be inferred from the call to createElement, which returns an which a sequence of tokens is transformed into a sequence of types
HTMLElement.2 On the other hand, the type of cssText is al- (in our case) [32]. Specifically, our task is a sequence of annotation
most certainly string, but this cannot soundly be inferred from tasks, in which all elements st in an input sequence s 1 . . . s N need
its usage here.3 For such identifiers, developers would need to add to be annotated. Therefore, when approaching this problem with
type annotations, as shown in the TS code on the right. probabilistic machine learning, the modeled probability distribution
is P (τ0 . . . τ N |s 0 . . . s N ), where τi represents the type annotation
2.2 Type Suggestion of si . In our case, the annotations are the types for the tokens in the
To the developer wishing to transition from the code on the left to input, where we align tokens that have no type (e.g. punctuation,
that on the right in Figure 1, a tool that can recommend accurate keywords) with a special no-type symbol.
type annotations, especially where traditional type inference fails, Although deriving type annotations has many similarities to POS
would be helpful. This type suggestion task of easing the transition tagging and NER, it also presents some unique characteristics. First,
from a partially to a fully typed code-base is the goal of our work. our tasks has a much larger set of possible type annotations. The
We distinguish two objectives for type suggestion: widely used Penn Treebank Project uses only 36 distinct parts-of-
speech tags for all English words, while we aim to predict more than
(1) Closed-world type suggestion recommends annotations to the
than 11,000 types (Section 4.2). Furthermore, NLP tasks annotate a
developer from some finite vocabulary of types, e.g. to add
single instance of a word, whereas we are interested in annotating
to declarations of functions or variables.
a variable that may be used multiple times, and the annotations
(2) Open-world type suggestion aims to suggest novel types to
ought to be consistent across occurrences.
construct that reflect computations in the developer’s code.
As a first step to assisting developers in annotating their code, we
restrict ourselves to the first task and leave the second to future 3.1 A Neural Architecture for Type Inference
work. Specifically, our goal is to learn to recommend the (ca. 11,000) Similar to recent models in NLP, we turn to deep neural networks
most common types from a large corpus of code, including those for our type inference task. Recurrent Neural Networks (RNN) [9,
shown in Figure 1. To achieve this, we view the type inference 15, 19] have been widely successful at many natural language an-
problem as a translation problem between un-annotated JS/TS and notation tasks such as named entity recognition [12] and machine
annotated TS. We chose to base our work on TS because, as a translation [9]. RNNs are neural networks that work on sequences
superset of JS, it is designed to displace JS in developers’ IDEs. of elements, such as words, making them naturally fit for our task.
Thus, a growing body of projects have already adopted it (including The general family of RNNs is defined over a sequence of elements
s 1 . . . s N as ht = RNN xst , ht −1 where xst is a learned represen-

well known projects such as Angular and Reactive Extensions) and
we can leverage their code to train DeepTyper. We can use TS’ tation (embedding) of the input element st and ht −1 is the previous
compiler to automatically generate training data consisting of pairs output state of the RNN. The initial state h0 is usually set to a
of TS without type annotations and the corresponding types for null vector (0). Both x and h are high dimensional vectors, whose
DeepTyper’s training. dimensionality is tunable: higher dimensions allow the model to
Here, the fact that we are translating between two such closely capture more information, but also increase the cost of training and
related languages is a strength of our approach, easing the align- may lead to overfitting.
ment problem [9, 16] and vastly reducing the search space our As we feed input tokens to the network in order, the vector x
models must traverse.4 We train our translator, DeepTyper, on TS for each token is its representation, while h is the output state of
inputs (Figure 1, right), then test it on JS (left) or partially annotated the RNN based on both this current input and its previous state.
TS files. DeepTyper suggests variable and function annotations, Thus, RNNs can be seen as networks that learn to “summarize” the
consisting of return, local variable, and parameter types. input sequence s 1 . . . st with ht . There are many different imple-
mentations of RNNs; in this work, we use GRUs (Gated Recurrent
3 METHOD Unit) [9]. For a more extensive discussion of RNNs, we refer the
To approach type inference with machine learning, we are inspired reader to Goodfellow et al. [15].
by existing natural language processing (NLP) tasks, such as part-of- In general translation tasks (e.g. English to French), the length
speech (POS) tagging and named entity recognition (NER) [12, 20]. and ordering of words in the input and output sequence may be
In those tasks, a machine learning model needs to infer the role of different. RNN-based translation models account for these changes
a given word from its context. For example, the word “mail” can be by first completely digesting the input sequence, then using their
either a verb or a noun when viewed in isolation, but when given final state (typically plus some attention mechanism [26]) to con-
context in the sentence “I will mail the letter tomorrow”, the part struct the output sequence, token by token. In our case, however,
of speech becomes apparent. To solve this ambiguity, NLP research the token and type sequence are perfectly aligned, allowing us to
has focused on probabilistic methods that learn from data. treat our suggestion task as a sequence annotation task, also used
for POS tagging and NER. In this setting, for every input token that
2 Provided the type of ownerDocument is Document, which may itself require an we provide to the RNN, we also expect an output type judgement.
annotation. Since the RNN does not have to digest the full input before mak-
3 Grammatically, it could e.g. be number.
4 Vasilescu et al.’s work on deobfuscating JS also successfully leverages machine trans- ing type judgements, using this precise alignment can yield better
lation between two closely related languages [33]. performance.

154
ESEC/FSE ’18, November 4–9, 2018, Lake Buena Vista, FL, USA V. J. Hellendoorn, C. Bird, E. T. Barr, and M. Allamanis

long-range dependencies.6 To address this problem, we propose a


- num - - - - num - num -
consistency layer as an extension to the standard biRNN, where the
context representation for the token st is
Project to Type Vectors 1
τ̂t = hbi hbi
Õ
t + i (2)
|V (t)|
Bi-directional GRU (650-dim) i ∈V (t )
where V (t) is the set of all locations that are bound to the same
+ identifier as the one in location t. Specifically, we average over the
token representations after the first bidirectional layer and combine
Bi-directional GRU (650-dim)
these with the input to the second bidirectional layer, as shown in
Embedding (300-dim) Figure 2. By concatenating the output vector hbi t with the average
representation of all the bound tokens, we encourage the model to
use long-range information from all usages of the identifier. Thus,
var x = 0 ; var y = x ;
the model learns to predict types based on both its sequentially local
representation and the consensus judgement for all other locations
Figure 2: Architecture of the neural network with an exam-
where this identifier occurs. We could restrict the non-local part
ple input and output, where connections between the layers
of Equation (2) to occurrences of the exact same variable only (e.g.
(at every token) are omitted for clarity. ‘num’ is short for
by running a def-use analysis), but we found that it is very rare for
‘number’ and ‘-’ indicates a dummy type for non-identifier
two differently-typed, but same-named variables to occur in the
tokens (which have no type). Note how, in DeepTyper, the
same file. We chose instead to average over all identifiers with the
two occurrences of x have an additional custom connection
same name, as this can provide more samples per identifier. Figure 2
to improve consistency in type assignment.
shows the resulting network; we call this model DeepTyper.
To a first approximation, we can use an RNN for our sequence Design Decisions. Our neural network encapsulates a set of de-
annotation task where we represent the “type judgement” context sign decisions and choices which we explain here. Using the biRNN
of the token st with τ̂t = ht . Then, to predict the type vector, i.e. a model allows us to capture a large (potentially unbounded) context
probability distribution over every type τ in the type vocabulary, around each token. Capturing a large context is crucial for predict-
we use an output layer to project the hidden state onto a vector ing the type annotation of a variable, since it allows our model
of dimension equal to the type vocabulary, followed by a softmax to learn long-range statistical dependencies (such as between a
layer to normalize it to a valid categorical probability distribution function’s type and its return statement). Additionally, including
over types. Each component of the type vector is then: the identifiers (e.g. variable names) allows the model to incorporate
exp(τ̂tT rτ + bτ ) probabilistic naming information to the type inference problem, a
Pst (τ ) = Í T
, (1) concept that has not been well explored in the literature. Also, it
τ ′ exp(τ̂t rτ ′ + bτ ′ ) should be noted that viewing the input program as a sequence of
where rτ is a representation learned for each type annotation τ , tokens is a design decision that trades off the potential to use richer
τ̂tT rτ is the inner product of the two vectors and bτ a scalar bias for structural information (such as ASTs, dependency graphs) for the
each annotation. However, this approach ignores all the relevant advantage of using well-understood models for sequence tagging
context to the right of st , i.e. information in st +1 . . . s N .5 For this whose training scales well with a large amount of data.
reason, we use an architecture called bidirectional RNNs (biRNN),
which combines two RNNs running in opposite directions, one 4 EVALUATION
traversing the sequence forward and the other in reverse. The Figure 3 gives an overview of our experimental setup. First, we
representation of the context for a single token st becomes the collect data from online open-source projects (Section 4.2). The
concatenation of the states of the forward (left-to-right) and reverse second step is initializing and training the deep learner (Section 3).
(right-to-left) RNNs, i.e. we set τ̂t in Equation 1 to τ̂t = hbi t = Finally, we evaluate our approach against, and in combination with,
[ht→ , ht← ]: the concatenation of the hidden state ht→ , the forward a type inference engine, and we discuss how to use the trained
RNN, and ht← , the reverse RNN at position t. algorithm for general code fragments, demonstrated through a web
The network architecture we have described so far assumes that API. We conclude this section with an overview of the hardware
the annotations we produce for each token are independent of each used and corresponding memory use and timing information.
other. This tends to be true in natural language but is not the case
for source code: a variable may be used multiple times throughout 4.1 Objective
the code, but its true type remains the same as at its declaration. If As outlined in Section 2, the goal of this work is to suggest useful
we were to ignore the interdependencies among multiple tokens, type annotations from a fixed vocabulary for JavaScript (JS) and
our annotations might turn out inconsistent between usages of TypeScript (TS) code. Here, we define “useful type annotations” as
the same variable. Although the RNN might learn to avoid such those that developers have manually added in the TS code, and
inconsistencies, in practice even long-memory RNNs such as GRUs which we remove to produce our training data. In TS, there are
have quite limited memory that makes it hard to capture such
6 Attention mechanisms [26], could be used to partially relieve this issue, but this
5 Which is particularly important for this task; consider annotating x in var x = 0. extension is left to future work.

155
Deep Learning Type Inference ESEC/FSE ’18, November 4–9, 2018, Lake Buena Vista, FL, USA

Table 1: Statistics of the dataset used in this study.


Github
Types Web API
API
Category Projects Files Tokens
1 2 3 Train 620 49,850 17,955,121
Held-out 78 7,854 3,918,175
TypeScript
Compiler
Deep Learner CheckJS Test 78 4,650 1,884,385

4.2 Data
Tokens Data Collection. We collected the 1,000 top starred open-source
projects on Github that predominantly consisted of TypeScript
Figure 3: Overview of the experimental setup, consisting of code on February 28, 2018; this is a similar approach to Ray et al.’s
three phases: data gathering, learning from aligned types, study of programming languages [27]. Each project was parsed
and evaluation. with the TypeScript compiler tsc, which infers type information
(possibly any) for all occurrences of each identifier. We removed
all files containing more than 5,000 tokens for the sequences to fit
three categories of identifiers that allow optional7 type annotations: within a minibatch used in our deep learner. This removed only
function return types, function parameters and variables. Deep- a small portion of both files (ca. 0.9%) and tokens (ca. 11%). We
Typer learns to suggest these by learning to assign a probability also remove all projects containing only TypeScript header files,
distribution over types, denoted a type vector, to each identifier which especially includes all projects from the ‘DefinitelyTyped’
occurrence in a file. To improve training, we do not only learn to eco-system. After these steps, our dataset contains 776 TypeScript
assign types to definition sites, where the annotation would be projects, with statistics listed in Table 1.
added, but to all occurrences of an identifier. This helps the deep Our dataset was randomly split by project into 80% training data,
learner include more context in its type judgements and allows 10% held-out (or validation) data and 10% test data. Among the
us to enforce its additional consistency constraint as described in largest projects included were Karma-Typescript (a test framework
Section 3. for TS), Angular and projects related to Microsoft’s VS Code. We
The model is presented with the code as sequential data, with focus only on inter-project type suggestion, because we believe
each token aligned with a type. Each token and type are encoded in this to be the most realistic use of our tool. That is, the model is
their respective vocabularies (see Section 4.2) as a one-hot vector trained on a pre-existing set of projects and then used to provide
(with a one at the index of the correct token/type and zeros oth- suggestions in a different/new project that was not seen during
erwise). The type may be a (deterministically assignable) no-type training. Future work may study an intra-project setting, in which
for tokens such as punctuation and keywords; we do not train the the model can benefit from project-specific information, which will
algorithms to assign these. Given a sequence of tokens, the model likely improve type suggestion accuracy.
is tasked to predict the corresponding sequence of types.
Token and Type Vocabularies. As is common practice in natural
At training time, the model’s accuracy is measured in terms of
language processing, we estimate our vocabularies on the training
the cross-entropy between its produced type vector and the true,
split and replace all the rare tokens (in our case, those seen less than
one-hot encoded type vector. At test time, the model is tasked with
10 times) and all unseen tokens in the held-out and test data with
inferring the correct annotations at the locations where developers
a generic UNKNOWN token. Note that we still infer types for these
originally added type annotations that we removed to produce our
tokens, even though their name provides no useful information
aligned data. Although the model infers types for all occurrences
to the deep learner. To reduce vocabulary size, we also replaced
of every identifier (because of the way it is trained), we report our
all numerals with ‘0’, all strings with “s" and all templates with
results on the true original type annotations both because this is
a simple ‘template‘, none of which affects the types of the code.
the most realistic test criterion and to avoid confusion.8
The type vocabulary is similarly estimated on the types of the
We evaluate the model primarily in terms of prediction accuracy:
training data, except that rare types (again, those seen less than 10
the likelihood that the most activated element of the type vector
times in the training data) and unseen types are simply treated as
is the correct type. We focus on assigning non-any types (recall
any. The number of tokens and types strongly correlates with the
that any expresses uncertainty about a type), since those will be
complexity of the model, so we set the vocabulary cut-off as low as
most useful to a developer. We furthermore distinguish between
was possible while still making training feasible in reasonable time
evaluating the accuracy at all identifier locations (including non-
and memory. The resulting vocabularies consist of 40,195 source
definition sites, as we do at training time) and inferring only at
tokens and 11,830 types.
those positions where developers actually added type annotations
in our dataset. For more details, see Section 4.4. Aligning Data. To create an oracle and aligned corpus, we use
the compiler to add type annotations to every identifier. We then
7 Here, remove all type annotations from the TS code, in order to create
any is implicit if no annotation is added.
8 In
brief, across all identifiers, DeepTyper reaches accuracies close to that of the code that more closely resembles JS code. Note that this does not
compiler’s type inference and a hybrid of the two was able to yield superior results. always produce actual JS code since TS includes a richer syntax

156
ESEC/FSE ’18, November 4–9, 2018, Lake Buena Vista, FL, USA V. J. Hellendoorn, C. Bird, E. T. Barr, and M. Allamanis

beyond just type annotations. We create two types of oracle datasets complement CheckJS by providing plausible, verifiable recommen-
from this translation: dations precisely where the compiler is uncertain.
(1) ALL identifier data (training): we create an aligned cor-
pus between tokens and types, in which every occurrence JSNice. In our final experiment, we compare the deep learner’s
of every identifier has a type annotation from the compiler. performance with that of JSNice [28]. JSNice was proposed as a
This is the type of oracle data that we use for training. This method to (among others) learn type annotations for JavaScript
data likely includes more types than a developer would want from dependencies between variables, so we thought it instructive
to annotate, since many could be inferred by the compiler. to compare and contrast performances. A perfect comparison is not
(2) GOLD, annotation-only data (testing): we align only the possible as JSNice differs from our work in several fundamental
types that developers annotated with the declaration site ways: (1) it focuses on JavaScript code only whereas our model
where the annotation was added. All other tokens are aligned is trained on TypeScript code with a varying degree of similarity
with a no-type. This provides the closest approximation of to plain JavaScript, (2) it assigns a limited set of types, includ-
the annotations that developers care about and serves as our ing number, string, Object, Array, a few special cases of
test data. Object such as Element and Document, and ? (unsure), and
(3) it requires compiler information (e.g. dependencies, scoping),
4.3 Experiments and Models whereas our approach requires just an aligned corpus and is other-
wise language-agnostic.
DeepTyper. We study the accuracy and behavior of deep learning
networks when applied to type inference across a range of metrics
(see Section 4.4). Our proposed model enhances a conventional 4.4 Metrics
RNN structure with a consistency layer as described in Section 3 We evaluate our models on the accuracy and consistency of their
and is denoted DeepTyper. We compare this model against a plain predictions. Since a prediction is made at each identifier’s occur-
RNN with the same architecture minus the consistency layer. rence, we first evaluate each occurrence separately. We measure the
For our RNNs, we use 300-dimensional token embeddings, which rank of the correct prediction and extract top-K accuracy metrics.
are trained jointly with the model, and two 650-dimensional hidden We evaluate the various models’ performances on real-world type
layers, implemented as a bi-directional network with two GRUs annotations (the GOLD data). Unless otherwise stated, we only
each (Section 3). This allows information to flow forward and back- focus on suggesting the non-any types in our aligned datasets,
ward in the model and improves its accuracy. Finally, we use drop- since inferring any is generally not helpful. The RNN also emits a
out regularization [31] with a drop-out probability of 50% to the probability with its top prediction, which can be used to reflect its
second hidden layer and apply layer-normalization after the embed- “confidence” at that location. This can be used to set a minimum
ding layer. As is typical in NLP tasks like this, the token sequence confidence threshold, below which DeepTyper’s suggestions are
is padded with start- and end-of-sentence tags (with no-type) as not presented. Thus, we also show precision/recall results when
cues for the model. varying this confidence threshold for DeepTyper. Finally, we are
We train the deep learner for 10 epochs with a minibatch size interested in how consistent the model is in its assignment of types
of up to five thousand tokens, requiring ca. 4,100 minibatches per to identifiers across their definition and usages in the code. Let X be
epoch. We use a learning configuration that is typical for these the set of all type-able identifiers that occur more than once in some
tasks in NLP settings and fine-tuned our hyper-parameters using code of interest. For DT : X → N, let DT (x) denote the number of
our validation data. We use an Adam optimizer [23]; we initialize types DeepTyper assigns to x, across all of its appearances. Ideally,
the learning rate to 10−3 and reduce it every other epoch until it ∀x ∈ X , DT (x) = 1; indeed, this is a constraint that standard type
reaches 10−4 where it remains stable; we set momentum to 1/e inference obeys. Like all stochastic approaches, DeepTyper is not
after the first 1,000 minibatches and clip total gradients per sample so precise. Let Y , {x | DT (x) > 1, ∀x ∈ X }. Then the type incon-
to 15. Validation error is computed at every epoch and we select sistency measure of a type inference approach, like DeepTyper, that
the model when this error stabilizes; for both of our RNN models, does not necessarily find the principal type of a variable across all
this occurred around epoch 5. |Y |
of its uses, is: |X | .
TSC + CheckJS. In the second experiment, we compare our deep
learning models against those types that the TypeScript compiler 4.5 Experimental Setup
(tsc) could infer (after removing type annotations), when also
equipped with a static type-inference tool for JavaScript named The deep learning code was written in CNTK [4]. All experiments
CheckJS.9 CheckJS reads JavaScript and provides best effort type are conducted on an NVIDIA GeForce GTX 1080 Ti GPU with 11GB
inference, assigning any to those tokens to which it cannot assign of graphics memory, in combination with an 6-core Intel i7-8700
a more precise type. Since TSC+CheckJS (hereafter typically ab- CPU with 32GB of RAM. Our resulting model requires ca. 500MB of
breviated “CheckJS”) has access to complete compiler and build RAM to be loaded into memory and can be run on both a GPU and
information of the test projects (while DeepTyper is evaluated in an CPU. It can provide type annotations for (reasonably sized) files in
inter-project setting), our main aim is not to outperform CheckJS well under 2 seconds.
but rather to demonstrate how probabilistic type inference can Our algorithm is programmatically exposed through a web API
(Figure 4) that allows users to submit JavaScript or TypeScript files
9 see https://github.com/Microsoft/TypeScript/wiki/Type-Checking-JavaScript-Files and annotates each identifier with its most likely types, subject to a

157
Deep Learning Type Inference ESEC/FSE ’18, November 4–9, 2018, Lake Buena Vista, FL, USA

Table 3: Accuracy on the 10 most common, and all other


types, with ‘any’ included for reference

Type Count Top-K Accuracy


top-1 top-5
Top 10 total 9,946 71.1% 95.6%
Others total 5,158 29.6% 53.2%
Figure 4: A screen-shot of our web interface on the example any* 8,452 66.8% 97.2%
from Figure 1.
*included only for reference; suggesting any is typically not helpful to developers.

Table 2: Accuracy results of various models, where Deep-


Typer includes the proposed consistency layer (Section 3)
and naïve assigns each identifier the MLE distribution of described in Section 4.4. By this metric, the plain RNN assigns an
types given that identifier from the training data. inconsistent type 17.3% of the time. Our consistency layer has the
effect of taking into account the average type assignment for each
identifier in a function and achieves a modest, but significant con-
Model Top-k Accuracy
sistency error reduction of around 2 percentage points, to 15.4%.
GOLD@1 GOLD@5
Importantly, it does not accomplish this by sacrificing performance
Naïve 37.5% 78.9% (as it might by gravitating to common types), but instead slightly
Plain RNN 55.0% 81.1% boosts prediction accuracy as shown above. This shows promise for
DeepTyper 56.9% 81.1% further investigation into adding global information to these models
(Section 6). Thus, we use the DeepTyper model in our experiments
going forward.
confidence threshold. All our code for training and evaluating Deep-
Typer is released on https://github.com/DeepTyper/DeepTyper. 5.1.2 Performance Characteristics. A few, common types, account
for most of the type annotations in the TypeScript data. We study
5 RESULTS the discrepancies between the predictability of the 10 most com-
We present our results in three phases, as per Section 4.3. We first mon types vs. the ca. 11,000 other types in Table 3. We also include
study how well deep learning algorithms are suited for type infer- prediction statistics of the any type for reference, which was by far
ence in general, and study the notion of consistency specifically. the most common type in the training data,10 but was substantially
Then, we compare DeepTyper’s performance with that of the Type- less common among the real annotations shown here. Since all iden-
Script compiler plus CheckJS, showing furthermore how the models tifiers are implicitly typed as any unless another type is provided,
can be complementary. Finally, we present a comparison (and com- recommending this type is not clearly useful. However, developers
bination) on plain JS functions with JSNice [28], which tackles a do evidently explicitly annotate some identifiers as any, so that
similar, if narrower task. accuracy on this task may still be useful for a type suggestion tool;
this deserves further investigation.
5.1 Deep Learning for Type Inference Excluding any, the top 10 types account for most of the typed
tokens. Among the most common types are the primitives string,
We first show the overall performance of the deep learning models
number and boolean, as well as several object types: Array,
on the test data, including both the plain RNN and our variant,
Promise and HTMLElement. As can be seen, predicting the rarer
DeepTyper, which is enhanced with a consistency layer. Table 2
types is substantially harder for DeepTyper, but it manages a usable
shows the prediction accuracy (top 1 and 5) of the true types w.r.t.
top-5 accuracy nonetheless. This is especially true at locations
the models in the 78 test projects on the GOLD dataset (Section 4.2).
where the model is most confident, as we discuss next.
We include a naïve model, which assigns each identifier the type
distribution that it has at training time. This model achieves an 5.1.3 Recommendation. The deep learning algorithm emits a prob-
acceptable accuracy without accounting for any identifier context, ability for each type assignment, which allows the use of a threshold
giving us a notion of what portion of the task is relatively simple. to determine which suggestions are likely to be correct (Section 4.4).
Xu et al. report a similar result for Python code [34], although Figure 5 shows the trade-off in precision and recall when varying
we stress that this is not an implementation of their model (See this threshold. Precision first exceeds 80% at a threshold of 90%,
Section 7). DeepTyper substantially outperforms it by including yielding a recall rate of ca. 50%. At a threshold of 99%, precision
contextual information and achieves a top-1 accuracy of nearly 60% exceeds 95% at a still respectable recall rate of ca. 14.9%. At this level,
and top-5 accuracy of over 80% across the GOLD dataset. DeepTyper could add more than 2,000 of the ca. 15,000 annotations
5.1.1 Consistency. In Table 2, DeepTyper yields higher prediction we extracted across the 78 test projects with very high precision.
accuracy than the plain RNN. As we stated in Section 3, we quali-
tatively found that the plain RNN model yielded poor consistency
between its assignments of types to multiple usages of the same 10 This
indicates that a great many identifiers could not be typed more specifically by
identifier. We quantify this concern with the inconsistency metric the compiler, or were too rare to be included in the vocabulary

158
ESEC/FSE ’18, November 4–9, 2018, Lake Buena Vista, FL, USA V. J. Hellendoorn, C. Bird, E. T. Barr, and M. Allamanis

100% Table 4: Accuracy of the three models (where DT is Deep-


99.9% 99% Typer and CJ is the TypeScript compiler with CheckJS en-
80% abled) on both datasets. “Hits” reflects when DT overrules
90%
CJ and improves accuracy; “Misses” where it worsens accu-
50% racy. “Setting” specifies whether only CheckJS’ ‘any’ cases
Precision

60%
0% or all types can be overruled by DeepTyper, and the mini-
40% mum confidence for DeepTyper to act. Results for CJ and
DT by themselves are shown independent of threshold for
20% clarity (and are thus identical in their columns).

0% Setting Accuracy Hits Misses


0% 20% 40% 60% 80% 100% CJ DT Hybrid
Recall
any, 90% 10.5% 56.9% 37.6% 27.1% 1.22%
any, 99% idem. idem. 20.6% 10.2% 0.07%
Figure 5: Recall vs. Precision of DeepTyper as a recom- any, 99.9% idem. idem. 12.2% 1.80% 0.00%
mender on the test data subject to probability thresholds (of
all, 90% idem. idem. 38.5% 28.2% 1.41%
the top suggestion) that reflect the model’s confidence.
all, 99% idem. idem. 21.1% 10.7% 0.09%
all, 99.9% idem. idem. 12.3% 1.85% 0.00%
5.2 Conventional Type Inference
We compare our results with those obtained by running the Type-
Script compiler with CheckJS (see Section 4.3) on each project in the a second set of results where DeepTyper is allowed to alter “all”
test corpus. Our main interest is a hybrid model: when the compiler type judgements (not just anys) when it is sufficiently confident.
has access to each test project’s complete build information, its type Although the models proved complementary on our training data,
judgements are sound (although CheckJS may contribute a small CheckJS could not to beyond DeepTyper here on the real developer
number of unsound, heuristic predictions). The cases where it is annotations at test time. This strongly indicates that developers add
unsure and defaults to any are locations where the deep learner annotations predominantly in those places where the type inference
may be able to contribute, since it ‘understands’ what types are tool could not infer the correct type. It also stresses the relevance
natural in various locations. Thus, the hybrid model first assigns of our tool: in those cases where developers would need it most, it
each variable its CheckJS type. When CheckJS assigns any to an yields a top-1 accuracy of over 50% (and, referring back to Table 2, a
identifier and DeepTyper is sufficiently confident in its suggestion, top-5 accuracy of over 80%). Furthermore, the hybrid model proves
the hybrid switches to DeepTyper’s suggested type. Per Figure 5, useful at higher confidence rates by reducing DeepTyper’s incorrect
we use confidence thresholds of 90%, 99% and 99.9% to achieve a types: at a 90% threshold, DeepTyper can contribute more than
balance between high precision (preferred in this setting) and recall. 4,000 types with over 95% precision to CheckJS’ own type inference!
We also report, as a percentage of all prediction points, how Allowing DeepTyper to correct all types vs. just any does not
often DeepTyper correctly changes an any into a more specific appear to be particularly more rewarding in terms of Hits/Misses
type (Hits) and how often DeepTyper incorrectly suggests a type trade-off. In all cases, setting a higher threshold tends to improve
when CheckJS was either correct, or had soundly resorted to any the true positive rate of DeepTyper, which is in line with the
(Misses). Crucially, these “Misses” are not sources of unsound- precision/recall trade-off seen earlier. Since developers migrating
ness! In a proper suggestion tool, any suggested type annotation their code are most likely to appreciate very precise suggestions
can first be passed through the type checker and ruled out if it is first, DeepTyper has the potential to be a cost-effective aide.
unsound. Although this feedback loop was too costly to run for our
automated evaluation, we manually investigated several incorrect 5.3 Comparison With JSNice
annotations and found that ca. half of these could be ruled out JSNice was introduced as an approach for (among others) type
by the compiler as unsound, whereas the remainder was sound, inference on JavaScript using statistics from dependency graphs
even if incorrect. This includes cases where DeepTyper’s type was learned on a large corpus [28]. As discussed in Section 4.3, its
too broad: HTMLElement where HTMLButtonElement was approach is complementary to ours, so we thought it instructive to
expected, or different from what the user expected, but correct compare their performance with that of DeepTyper as well. Note
in the context, like cssText : number would be in Figure 1. that we are again using the original DeepTyper model here, not
Thus, we conclude that (1) any tool based on our model need not the “Hybrid” model from the previous section. Because JSNice is
introduce any unsound annotations, and (2) the “Misses” column available to use via a web form, we manually entered JavaScript
overstates how many incorrect annotations a user would actually functions and recorded the results.
encounter when also employing a type checker. Nonetheless, the We selected JavaScript functions uniformly at random from pub-
balance between “Hits” and “Misses” gives an indication of the lic projects on GitHub that were in the top 100 JavaScript projects
precision/effort trade-off at various thresholds. ranked by number of stars (similar to Ray et al. [27]). To avoid trivial
The top-1 accuracy (CheckJS gives only one suggestion) for the functions, we selected functions that take at least one parameter
three models is shown in Table 4, which for reference also shows and that return a value or have at least one declared variable in

159
Deep Learning Type Inference ESEC/FSE ’18, November 4–9, 2018, Lake Buena Vista, FL, USA

Table 5: Comparison of DeepTyper, JSNice, and Hybrid of no additional incorrect or partially correct annotations. At the 0%
both across thirty randomly selected JavaScript functions. threshold, the Hybrid model is more than 15% points more likely
to be correct than either model separately, while introducing fewer
Correct Partial Incorrect Unsure errors than DeepTyper would by itself.
Qualitatively, we find that DeepTyper particularly outperforms
JSNice [28] 51.9% 1.9% 0.9% 45.4% JSNice when the type is intuitively clear from the context, such
DeepTyper ≥ 0% 55.6% 2.8% 6.5% 35.2% as for cssText in Figure 1. It expresses high confidence (and
DeepTyper ≥ 50% 51.9% 0.9% 2.8% 44.4% corresponding accuracy) in tokens who’s name include cues to-
DeepTyper ≥ 90% 35.2% 0.0% 0.0% 64.8% wards their type (e.g. “name” for string) and/or are used in id-
Hybrid ≥ 0% 71.3% 3.7% 4.6% 20.4% iomatic ways (e.g. concatenation with another string, or invok-
Hybrid ≥ 50% 70.4% 1.9% 1.9% 25.9% ing element-related methods on HTMLElement-related types).
Hybrid ≥ 90% 64.8% 1.9% 0.9% 32.4% JSNice often declares uncertainty on these because of a possibly
ambiguous type (e.g. string concatenation does not imply that the
their body. Thus each function requires two type annotations at right-hand argument is a string, and other classes may have
the very least. Because the evaluation had to be performed manu- declared similarly named methods). Vice versa, when JSNice does
ally, we examined thirty JavaScript functions.11 For each function infer a type, it is very precise: whereas DeepTyper often gravitates
we manually determined the correct types to use as an oracle for to a subtype or supertype (especially any, if a variable is used in
evaluation and comparison, assigning any if no conclusive type several far-apart places) of the true type, JSNice was highly accu-
could be assigned. As a result, we identified 167 annotations (on rate when it did not declare uncertainty and was able to include
function return types, local variables, parameters and attributes) of information (such as dataflow connections) from across the whole
which 108 were clearly not any types. function, regardless of size. Altogether, our results demonstrate
Again, we focus on predicting only the non-any types, since that these two methods excel at different locations, with JSNice
these are most likely to be helpful to the user. Cases in which JSNice benefiting from its access to global information and DeepTyper
predicted ? or Object, and cases where DeepTyper predicted any from its powerful learned intuition.
or was not sufficiently confident are all treated as “Unsure”. We
again show results for various confidence thresholds for DeepTyper
(across a slightly lower range than before, to better match the
6 DISCUSSION
“Unsure” rate of JSNice) and include another hybrid model, in which 6.1 Learning Type Inference
DeepTyper may attempt to “correct” any cases in which both JSNice Type inference is traditionally an exact task, and for good reason:
was uncertain (or did not annotate a type at all) and DeepTyper is unsound type inference risks breaking valid code, violating the
sufficiently confident. The results are shown in Table 5. central law of compiler design. However, sound type inference
At the lowest threshold, DeepTyper gets both more types correct for some programming languages can be greatly encumbered by
and wrong than JSNice, whereas at the highest threshold it makes features of the language design (such as eval() in JS). Although the
no mistakes at all while still annotating more than one-third of TypeScript compiler with CheckJS achieved good accuracy in our
the types correctly. JSNice made one mistake, in which it assigned experiments in which it had access to the full project, it could still
a type that was too specific given the context.12 We also count be improved substantially by probabilistic methods, particularly at
“partial” correctness, in which the type given was too specific, but the many places where it only inferred any. With partial typing
close to the correct type. This includes cases in which both JSNice now an option in languages such as TypeScript and Python, there
and DeepTyper suggest HTMLElement instead of Element. is a need for type suggestion engines, that can assist programmers
Overall, DeepTyper’s and JSNice’s performances are very similar in enriching their code with type information, preferably in a semi-
on this task, despite DeepTyper having been trained primarily on automatic way.
TypeScript code, using a larger type vocabulary and not requir- A key insight of our work is that type inference can be learned
ing any information about the snippet beyond its tokens. The two from an aligned corpus of tokens and their types, and such an aligned
approaches are also remarkably complementary. JSNice is almost corpus can be obtained fully automatically from existing data. This
never incorrect when it does provide a type, but it is more often un- is similar to recent work by Vasilescu et al., who use a JavaScript
certain, not providing anything, whereas DeepTyper makes more obfuscator to create an aligned corpus of real-world code and its
predictions, but is incorrect more often than JSNice. A Hybrid ap- obfuscated counter-part, which can then be reversed to learn to de-
proach in which JSNice is first queried and DeepTyper is used obfuscate [33], although they did not approach this as a sequence
if JSNice cannot provide a type shows a dramatic improvement tagging problem. This type of aligned corpus (e.g. text annotated
over each approach in isolation and demonstrates that JSNice and with parse tags, named entities) is often a costly resource in natural
DeepTyper work well in differing contexts and for differing types. language processing, requiring substantial manual effort, but comes
When using a 90% confidence threshold, the Hybrid model boosts all-but free in many software related tasks, primarily because they
the accuracy by 12.9% points (51.9% to 64.8%) while introducing involve formal languages for which interpreters and compilers exist.
11 The source of these functions and the functions themselves will be released after As a result, vast amounts of training data can be made available
anonymity is lifted for tasks such as these, to great benefit of models such as the deep
12 We found several more such cases among variables who’s true type was deemed
learners we used.
any and are thus not included in this table.

160
ESEC/FSE ’18, November 4–9, 2018, Lake Buena Vista, FL, USA V. J. Hellendoorn, C. Bird, E. T. Barr, and M. Allamanis

6.2 Strengths and Weaknesses of the RNN primitive types of JavaScript code by learning from a corpus. JSNice
We have shown that RNN-based models can learn a remarkably builds a dependency network among variables and learns statistical
powerful probabilistic notion of types through differentiable type correlations that predict the type. In contrast to this work, our
vectors. This probabilistic perspective on types is a necessity for deep learner considers a much wider context than is defined by
training these models and raises an interesting challenge: at once JSNice’s dependency network and aims to predict a larger set of type
the models can deliver a highly accurate source of type guesses, annotations. The work of Xu et al. [34] uses probabilistic inference
while at the same time not being able to make any guarantees for Python and defines a probabilistic method for fusing information
regarding the soundness of even its most confident annotations. from multiple sources such as attribute accesses and variable names.
For instance, if the RNN sees the phrase “var x = 0”, it may However, this work does not include a learning component but
deem the (clearly correct) type ‘number’ for ‘x’ highly accurate, rather uses a set of hand-picked weights on probabilistic constraints.
but not truth (i.e. assign it a probability very close to 1). A hybrid Both these works rely on factor graphs for type inference, while, in
approach provides a solution: when DeepTyper offers plausible and this work, we avoid the task of explicitly building such a graph by
natural type annotation suggestions, the type checker can verify directly exploiting the interaction of a strong deep neural network
these, thus preserving soundness, similar to how a programmer and a pre-existing type checker.
might annotate code. It is also interesting to ask if we can teach deep Applying machine learning to source code is not a new idea.
learning algorithms some of these abilities. Provable correctness Hindle et al. [18] learned a simple n-gram language model of code
is not out of the scope of these models, as was demonstrated for to assist code completion. Raychev et al. [28] developed a proba-
neural program inference using recursion [11]. bilistic framework for predicting program properties, such as types
DeepTyper’s probabilistic nature also leads to an intriguing kind or variable names. Other applications include deobfuscation [10],
of “type drift", also visible in our web tool, in which the probabilities coding conventions [7, 28] and migration [21, 25]. Vasilescu et al.
in a variable’s type vector change throughout its definition and specifically employ machine learning to an aligned corpus within
use in the code, even though its true type is fixed. We partially the same language, using an obfuscator to learn de-obfuscation of
mitigated this limited awareness of the global accuracy of its type JavaScript [33]. Their work is closely related to ours, although our
assignments by equipping the model with information that is lexi- approach works both within TypeScript and can enhance JavaScript
cally far away and saw gains in consistency and performance. Still, code into TypeScript code because the latter is a superset of the
a substantial number of consistency errors remain, allowing room former. Furthermore, our work learns to translate information be-
for improvement over the deep learners used in this work if global tween domains: from tokens to their types, whereas de-obfuscation
and local optimization can be balanced. Such a combination need is only concerned with translation between identifiers.
not come from deep learning alone: the global model may be a sym-
biosis with a static type checker, or a method such as conditional 8 CONCLUSION
random fields [28]. Our work set out to study to what extent type annotations can
be learned from the underlying code and whether such learners
6.3 Extensions can assist programmers to lower the annotation tax. Our results
The aligned corpus in our work is one between TypeScript code and are positive: we showed that deep learners can achieve a strong,
the types for each identifier in this code. As such, our work only probabilistic notion of types given code that extends across projects
scratches the surface of what this free Rosetta Stone could give! and to both TypeScript and plain JavaScript code. We also high-
Type inference is only one step in the compilation process and many light their present flaws and hope to inspire research into further
other parts of TypeScript’s enhancements over JavaScript could be improvements. Even more promising is that DeepTyper proved
learned, including type definitions, classes, public/private modifiers, to be complementary to a compiler’s type inference engine on an
etc.. Even fully transpiling TypeScript to JavaScript can be used to annotation task, even when the latter had access to complete build
create an aligned corpus (although no longer token-aligned, and information. Jointly, they could predict thousands of annotations
with a fair degree of boiler-plate code) that we may, in due time, be with high precision. Our tool is also complementary with JSNice
able to exploit to learn to convert entire files. This methodology is [28] on plain JavaScript functions, which shows that our model
not bound to our current language either; an obvious extension is is learning new, different type information from prior work. Our
to partially typed Python code, but similar tasks in many languages findings demonstrate potential for learning traditional software
(e.g. inferring nullity) may well be highly amenable to a comparable engineering tasks, type inference specifically, from aligned corpora.
approach.
ACKNOWLEDGEMENTS
7 RELATED WORK Vincent Hellendoorn was partially supported by the National Sci-
ence Foundation, award number 1414172
Type inference is a widely studied problem in programming lan-
guage research. Inferring types for dynamic languages has become
REFERENCES
an important research area in light of the widespread use of lan-
[1] [n. d.]. flow. https://flow.org/.
guages such as JavaScript and Python, and recent moves to allow [2] [n. d.]. mypy. http://mypy-lang.org/.
partial typing of these [6, 13, 34]. [3] [n. d.]. Spectrum IEEE 2017 Top Programming Languages. http://spectrum.ieee.
org/computing/software/the-2017-top-programming-languages.
Probabilistic type inference, i.e. the use of probabilistic reasoning [4] [n. d.]. The Microsoft Cognitive Toolkit. https://www.microsoft.com/en-us/
for inferring types has received recent attention. JSNice [28] infers cognitive-toolkit/.

161
Deep Learning Type Inference ESEC/FSE ’18, November 4–9, 2018, Lake Buena Vista, FL, USA

[5] [n. d.]. TypeScript. https://www.typescriptlang.org/. International Symposium on New Ideas, New Paradigms, and Reflections on Pro-
[6] Martin Abadi, Cormac Flanagan, and Stephen N Freund. 2006. Types for safe lock- gramming & Software. ACM, 173–184.
ing: Static race detection for Java. ACM Transactions on Programming Languages [22] Vineeth Kashyap, Kyle Dewey, Ethan A Kuefner, John Wagner, Kevin Gibbons,
and Systems (TOPLAS) 28, 2 (2006), 207–255. John Sarracino, Ben Wiedermann, and Ben Hardekopf. 2014. JSAI: A static analy-
[7] Miltiadis Allamanis, Earl T Barr, Christian Bird, and Charles Sutton. 2014. Learn- sis platform for JavaScript. In Proceedings of the 22nd ACM SIGSOFT International
ing natural coding conventions. In Proceedings of the 22nd ACM SIGSOFT Interna- Symposium on Foundations of Software Engineering. ACM, 121–132.
tional Symposium on Foundations of Software Engineering. ACM, 281–293. [23] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic opti-
[8] Esben Andreasen and Anders Møller. 2014. Determinacy in static analysis for mization. arXiv preprint arXiv:1412.6980 (2014).
jQuery. In ACM SIGPLAN Notices, Vol. 49. ACM, 17–31. [24] Leo A Meyerovich and Ariel S Rabkin. 2012. Socio-PLT: Principles for program-
[9] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine ming language adoption. In Proceedings of the ACM international symposium on
translation by jointly learning to align and translate. In ICLR 2015. New ideas, new paradigms, and reflections on programming and software. ACM,
[10] Benjamin Bichsel, Veselin Raychev, Petar Tsankov, and Martin Vechev. 2016. 39–54.
Statistical deobfuscation of Android applications. In Proceedings of the 2016 ACM [25] Anh Tuan Nguyen, Hoan Anh Nguyen, Tung Thanh Nguyen, and Tien N Nguyen.
SIGSAC Conference on Computer and Communications Security. ACM, 343–355. 2014. Statistical learning approach for mining api usage mappings for code migra-
[11] Jonathon Cai, Richard Shin, and Dawn Song. 2017. Making neural programming tion. In Proceedings of the 29th ACM/IEEE international conference on Automated
architectures generalize via recursion. arXiv preprint arXiv:1704.06611 (2017). software engineering. ACM, 457–468.
[12] Franck Dernoncourt, Ji Young Lee, and Peter Szolovits. 2017. NeuroNER: an [26] Chris Olah and Shan Carter. 2016. Attention and Augmented Recurrent Neural
easy-to-use program for named-entity recognition based on neural networks. Networks. Distill (2016). https://doi.org/10.23915/distill.00001
arXiv preprint arXiv:1705.05487 (2017). [27] Baishakhi Ray, Daryl Posnett, Vladimir Filkov, and Premkumar Devanbu. 2014.
[13] Michael Furr, Jong-hoon David An, Jeffrey S Foster, and Michael Hicks. 2009. A large scale study of programming languages and code quality in github. In
Static type inference for Ruby. In Proceedings of the 2009 ACM symposium on Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of
Applied Computing. ACM, 1859–1866. Software Engineering. ACM, 155–165.
[14] Zheng Gao, Christian Bird, and Earl T. Barr. 2017. To Type or not to Type: [28] Veselin Raychev, Martin Vechev, and Andreas Krause. 2015. Predicting program
On the Effectiveness of Static typing for JavaScript. In Proceedings of the 39th properties from big code. In ACM SIGPLAN Notices, Vol. 50. ACM, 111–124.
International Conference on Software Engineering. IEEE. [29] Koushik Sen, Swaroop Kalasapur, Tasneem Brutch, and Simon Gibbs. 2013.
[15] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Jalangi: A selective record-replay and dynamic analysis framework for Java-
Press. Script. In Proceedings of the 2013 9th Joint Meeting on Foundations of Software
[16] Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. Engineering. ACM, 488–498.
2006. Connectionist temporal classification: labelling unsegmented sequence [30] Jeremy G Siek, Michael M Vitousek, Matteo Cimini, and John Tang Boyland. 2015.
data with recurrent neural networks. In Proceedings of the 23rd international Refined criteria for gradual typing. In LIPIcs-Leibniz International Proceedings in
conference on Machine learning. ACM, 369–376. Informatics, Vol. 32. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.
[17] Stefan Hanenberg, Sebastian Kleinschmager, Romain Robbes, Éric Tanter, and [31] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan
Andreas Stefik. 2014. An empirical study on the impact of static typing on Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from
software maintainability. Empirical Software Engineering 19, 5 (01 Oct 2014), overfitting. The Journal of Machine Learning Research 15, 1 (2014), 1929–1958.
1335–1382. https://doi.org/10.1007/s10664-013-9289-1 [32] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning
[18] Abram Hindle, Earl T Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. with neural networks. In Advances in neural information processing systems. 3104–
2012. On the naturalness of software. In Software Engineering (ICSE), 2012 34th 3112.
International Conference on. IEEE, 837–847. [33] Bogdan Vasilescu, Casey Casalnuovo, and Premkumar Devanbu. 2017. Recovering
[19] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural clear, natural identifiers from obfuscated JS names. In Proceedings of the 2017 11th
computation 9, 8 (1997), 1735–1780. Joint Meeting on Foundations of Software Engineering. ACM, 683–693.
[20] Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF models for [34] Zhaogui Xu, Xiangyu Zhang, Lin Chen, Kexin Pei, and Baowen Xu. 2016. Python
sequence tagging. arXiv preprint arXiv:1508.01991 (2015). probabilistic type inference with natural language support. In Proceedings of the
[21] Svetoslav Karaivanov, Veselin Raychev, and Martin Vechev. 2014. Phrase-based 2016 24th ACM SIGSOFT International Symposium on Foundations of Software
statistical translation of programming languages. In Proceedings of the 2014 ACM Engineering. ACM, 607–618.

162

You might also like