0% found this document useful (0 votes)
21 views8 pages

Selene: A Pytorch-Based Deep Learning Library For Sequence Data

Selene is a PyTorch-based deep learning library designed for the development and application of models on biological sequence data, enabling researchers to train and evaluate models easily. It supports various workflows, including model training, variant effect prediction, and in silico mutagenesis, making deep learning techniques more accessible to biomedical scientists. Selene aims to enhance the usability of deep learning in biology by providing a user-friendly interface and comprehensive support for different levels of computational experience.

Uploaded by

Raba Patrik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views8 pages

Selene: A Pytorch-Based Deep Learning Library For Sequence Data

Selene is a PyTorch-based deep learning library designed for the development and application of models on biological sequence data, enabling researchers to train and evaluate models easily. It supports various workflows, including model training, variant effect prediction, and in silico mutagenesis, making deep learning techniques more accessible to biomedical scientists. Selene aims to enhance the usability of deep learning in biology by providing a user-friendly interface and comprehensive support for different levels of computational experience.

Uploaded by

Raba Patrik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Brief Communication

https://doi.org/10.1038/s41592-019-0360-8

Selene: a PyTorch-based deep learning library for


sequence data
Kathleen M. Chen 1,5
, Evan M. Cofer 2,3,5
, Jian Zhou 1,2
and Olga G. Troyanskaya 1,2,4
*

To enable the application of deep learning in biology, we pres- and training for model development (Fig. 1a) and (2) prediction and
ent Selene (https://selene.flatironinstitute.org/), a PyTorch- visualization for analyses using the trained model (Fig. 1b,c). With
based deep learning library for fast and easy development, Selene, researchers can run model development and analysis work-
training, and application of deep learning model architec- flows out-of-the-box. For more advanced use cases, Selene provides
tures for any biological sequence data. We demonstrate on templates for extending modules within each workflow so that users
DNA sequences how Selene allows researchers to easily train can adapt the library to their particular research questions.
a published architecture on new data, develop and evaluate a There has been recent work to make deep learning in biology
new architecture, and use a trained model to answer biological more accessible: DragoNN is a toolkit for teaching deep learning in
questions of interest. regulatory genomics; pysster12 is a Python package for training con-
Deep learning describes a set of machine learning techniques volutional neural networks on biological sequence data; and Kipoi13
that use stacked neural networks to extract complicated patterns is a framework to archive, use, and build on published predictive
from high-dimensional data1. These techniques are widely used for models in genomics. These resources constitute the nascent software
image classification and natural language processing, and have led ecosystem for sequence-level deep learning. Selene is our contribu-
to very promising advances in the biomedical domain, including tion to this ecosystem. Selene supports general model development
genomics and chemical synthesis1–3. In regulatory genomics, net- not constrained to a particular architecture (in contrast to pysster)
works trained on high-throughput sequencing data (for example, or task (in contrast to DragoNN) and is designed for users with dif-
ChIP-seq), or ‘sequence-based models’, have become the de facto ferent levels of computational experience. Users are supported in
standard for predicting the regulatory and disease impact of muta- tasks ranging from simply applying an existing model, to retraining
tions4–7. While deep-learning-related publications are often accom- it on new data (tasks also supported by Kipoi), to developing new
panied by the associated pre-trained model6,8,9, a key challenge in model architectures (a task that is challenging to do with any other
both developing new deep learning architectures and training exist- tool). The models developed using Selene can be shared and used
ing architectures on new data is the lack of a comprehensive, gener- through the Kipoi framework.
alizable, and user-friendly deep learning library for biology. To demonstrate Selene’s capabilities for developing and evalu-
Beyond regulatory genomics, sequence-level deep learning mod- ating sequence-level deep learning models, we use it to (1) train a
els have broad promise in a wide range of research areas, including published architecture on new data; (2) develop, train, and evalu-
recent advances on prediction of disease risk of missense mutations ate a new model (improving a published model); and (3) apply a
in proteins10 and potential applications to, for example, predict- trained model to data and visualize the resulting predictions in the
ing target site accessibility in genome editing. We must enable the case studies that follow.
adoption and active development of deep-learning-based methods In the first case study, a researcher wants to use the DeepSEA4
in biomedical sciences. For example, a biomedical scientist excited model architecture as a starting point and train the model on dif-
by a publication of a model capable of predicting the disease-asso- ferent data. Selene is completely general and a user can easily use or
ciated effect of mutations should be able to train a similar model specify any model of their choice using modules in PyTorch.
on their own ChIP-seq data focused on their disease of interest. A Suppose a cancer researcher is interested in modeling the regula-
bioinformatician interested in developing new model architectures tory elements of the transcription factor GATA1, specifically focus-
should be able to experiment with different architectures and evalu- ing on proerythroblasts in bone marrow. This is a tissue-specific
ate all of them on the same data. Currently, this requires advanced genomic feature that DeepSEA does not predict. The researcher
knowledge specific to deep learning2,11, substantial new code devel- downloads peak data from Cistrome14 and a reference genome
opment, and associated time investment far beyond what most bio- FASTA file. Once a researcher formats the data to match the docu-
medical scientists are able to commit. mented inputs15 and fills out the necessary training parameters (for
Here we present Selene, a framework for developing sequence- example, batch size or learning rate), they can use Selene to train
level deep learning networks that provides biomedical scientists the DeepSEA architecture on their data with no new lines of Python
with comprehensive support for model training, evaluation, and code. In this example, they find that the model obtains an area
application across a broad range of biological questions. Sequence- under the curve (AUC) of 0.942 on this feature (Fig. 2a).
level data refers to any type of biological sequence such as DNA, Selene automatically generates training, testing, and validation
RNA, or protein sequences and their measured properties (for samples from the provided input data. The samples generated for
example, binding of transcription factors or RNA-binding proteins, each partition can be saved and used in subsequent model develop-
or DNase sensitivity). Selene contains modules for (1) data sampling ment so that comparisons can be made across models with different

Flatiron Institute, Simons Foundation, New York, NY, USA. 2Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ, USA.
1

Graduate Program in Quantitative and Computational Biology, Princeton University, Princeton, NJ, USA. 4Department of Computer Science, Princeton
3

University, Princeton, NJ, USA. 5These authors contributed equally: Kathleen M. Chen, Evan M. Cofer. *e-mail: [email protected]

Nature Methods | VOL 16 | APRIL 2019 | 315–318 | www.nature.com/naturemethods 315


Brief Communication NATure MeThoDS

a
Model training and evaluation
Model architecture,
for example DeepSEA

Feature ROC curves


train.yml configuration file 1.0

0.8

True positive rate


Run Selene 0.6
(train and evaluate)
Genome-wide signals from
DNA/RNA/protein sequences, 0.4
for example TFs, DNase, histones
0.2

0.0
0.0 0.2 0.4 0.6 0.8 1.0
False positive rate

Maximum predicted effect score across features


1

b 0.8

Variant effect prediction


max diff score: 0.749
chr10 48031090,G/A
closest protein-coding gene(s):ASAH2C

variants_predict.yml configuration file

Predicted effect
0.6

Trained model
Run Selene
Variants VCF file
0.4

0.2

79
43

88
23

28

39
3

71
86

0
4

42

93
4
0
60

65
7
97

7
45

44

17

68
95

53
48
48

43
18

67
32

65
98

64

30

40
43

01

64
93

69
14
80

65

47

03
04

78
81
04

06
62

21
18

23
65

93

53

64
14

62

57
77

67
04
90

33

87

57
72

81
37
24

32
83

42
78

50
52

32

62

77
19

17

78

42
51
87
r1

10

12
21
20

11

r9

7
r7

r8
ch

1
r4

3
r5

8
r6

r1
0

1
r2
r1

ch
ch

ch

r1

r1

r1

r1

r1
0
r1

r1
r2
r1

r2
r3

ch

ch

ch

ch
ch
ch

r1

r1
ch

ch

ch

ch

ch
ch

ch
ch
ch

ch
ch

ch

ch
Genome coordinates
c
In silico mutagenesis
Sequences FASTA file

Trained model

sequences_predict.yml configuration file

Run Selene

Center 200 bp of 1,000 bp sequence: chr8 (106051472, 106052472)


T
Bases

G Original base at position


C
A
400 450 500 550 600
Position in sequence

–5.0 –2.5 0.0 2.5 5.0

Fig. 1 | Overview of Selene. a, As input, the library accepts (left) the model architecture, dataset and (middle) a configuration file that specifies the
necessary input data paths and training parameters. Selene automatically splits the data into training and validation/testing, trains the model, evaluates
it, and (right) generates figures from the results. b, Selene supports variant effect prediction with the same configuration file format and includes
functionality to visualize the variants and their difference scores as a Manhattan plot, where a user can hover over each point to see variant information. c,
Selene calculates mutation effect scores and visualizes the scores as a heat map.

architectures and/or parameters. Further, Selene automatically We provide the code and results for this example in Selene’s
evaluates the model on the test set after training and, in this case, GitHub repository (https://github.com/FunctionLab/selene; see
generates figures to visualize the model’s performance as receiver case 1 in the ‘manuscript’ folder).
operating characteristic and average precision curves. In another use case, a researcher may want to develop and train
Now that the researcher has a trained model, they can use a new model architecture. For example, a bioinformatician might
Selene to apply in silico mutagenesis—converting every position want to modify a published model architecture to see how that
in the sequence to every other possible base4 (DNA and RNA) or affects performance. First, the researcher uses modules in PyTorch
amino acid (protein sequences)—to a set of GATA1 sequences to specify the model architecture they are interested in evaluat-
drawn from the test set and examine the consequences of these ing; in this case study, they try to enhance the DeepSEA architec-
‘mutations’. Selene supports visualization of the outputs of in silico ture with batch normalization and three additional convolutional
mutagenesis as a heat map and/or motif plot. By visualizing the layers. The researcher specifies parameters for training and the
log2 fold change for these sequences in a heat map, the researcher paths to the model architecture and data in a configuration file
can see that the model detects disruptions in binding at the GATA and passes this as input to the library’s command-line interface
motif (Fig. 2b). (CLI). Training is automatically completed by Selene; afterward,

316 Nature Methods | VOL 16 | APRIL 2019 | 315–318 | www.nature.com/naturemethods


NATure MeThoDS Brief Communication
a b In silico mutagenesis on GATA1 sequences
Selene-generated model performance (Selene-generated heat maps)
ROC curve Original base at position
1.0 Center 200 bp of 1,000 bp sequence: chr8 (106051472, 106052472)
T

Bases
G
0.8 C
True positive rate A
0.6 400 450 500 550 600
Position in sequence
0.4 Center 200 bp of 1,000 bp sequence: chr8 (89567149, 89568149)
T

Bases
0.2 G
C
A
0.0 400 450 500 550 600
0.0 0.2 0.4 0.6 0.8 1.0 Position in sequence
False positive rate

–5.0 –2.5 0.0 2.5 5.0

Fig. 2 | Visualizations generated by using Selene to train and apply a model to sequences. a, Selene visualization of the performance of the model trained in
the first case study. b, Selene visualization of in silico mutagenesis on the case-study-trained model for 20 randomly selected GATA1 sequences in the test set
(two representative plots displayed here; all heat maps generated are displayed in the example Jupyter notebook, https://github.com/FunctionLab/selene/
blob/master/manuscript/case1/3_visualize_ism_outputs.ipynb). Bases in the original sequence are distinguished by the gray stripes in the heat map cells.

a Selene-generated model performance b Mean difference between SNP groups


Feature ROC curves Feature K562|H3K36me3|None (q value = 3.89 × 10–67)
1.0

–0.20
0.8
predicted effect scores
Gaussian-transformed
True positive rate

–0.21
0.6
–0.22
0.4
–0.23
0.2
–0.24
0.0
0.0 0.2 0.4 0.6 0.8 1.0 GWAS GWAS
False positive rate nominally significant nonsignificant

Fig. 3 | Using Selene to train a model and obtain model predictions for variants in an Alzheimer’s GWAS study. a, Selene visualization of the performance of
the trained six-convolutional-layer model. b, We visualize the mean and 95% confidence intervals of the quantile-normalized (against the Gaussian distribution)
predicted effect scores of the two variant groups for the genomic feature H3K36me3 in K562 cells, the feature in the model with the most significant difference
(one-sided Wilcoxon rank-sum test, adjusted P value using Benjamini–Hochberg of 3.89 × 10−67). After applying the multiple testing correction, 914 of the 919
genomic features that the model predicts showed a significant difference (ɑ < 0.05) between the groups. SNP, single-nucleotide polymorphism.

the researcher can easily use Selene to compare the performance researcher finds that the predicted effect is significantly higher
of their new model to the original DeepSEA model on the same for GWAS nominally significant variants than for non-significant
chromosomal holdout dataset. variants, indicating that the new model is indeed able to prioritize
In this case study, the researcher finds that the deeper archi- potential disease-associated variants (one-sided Wilcoxon rank-
tecture achieves an average AUC of 0.938 (Fig. 3a) and an average sum test; the most significant feature, H3K36me3 in K562 cells,
area under the precision recall curve (AUPRC) of 0.362, which has an adjusted P value, by Benjamini–Hochberg correction, of
is an improvement over the average AUC of 0.933 and AUPRC 3.89 × 10−67) (Fig. 3b).
of 0.342 of the original three-convolutional-layer model. The Selene’s modeling capability extends far beyond the case stud-
researcher can share this model with a collaborator and upload ies described here. The library can be applied to not only DNA
it to the Kipoi13 model zoo, a repository of trained models for but also RNA and protein sequences, and not only chromatin data
regulatory genomics, with which Selene-trained models are fully but any current genome-, transcriptome-, or even proteome-wide
compatible (an example is available in the GitHub repository at measurements. We developed Selene to increase the accessibility of
https://github.com/FunctionLab/selene/tree/master/manuscript/ deep learning in biology and facilitate the creation of reproducible
case2/3_kipoi_export). workflows and results. Furthermore, Selene is open-source soft-
In the final case study, a human geneticist studying Alzheimer’s ware that will continue to be updated and expanded on the basis of
wants to apply the six-convolutional-layer model developed community and user feedback.
in the previous case study, so they first assess its ability to pri-
oritize potential disease-associated variants. Specifically, they use Online content
Selene to make variant effect predictions for nominally signifi- Any methods, additional references, Nature Research reporting
cant variants (P < 0.05, n = 422,398) and non-significant variants summaries, source data, statements of data availability and asso-
(P > 0.50, n = 3,842,725) reported in the International Genomics ciated accession codes are available at https://doi.org/10.1038/
of Alzheimer’s Project16 Alzheimer’s disease GWAS17. The s41592-019-0360-8.

Nature Methods | VOL 16 | APRIL 2019 | 315–318 | www.nature.com/naturemethods 317


Brief Communication NATure MeThoDS

Received: 8 October 2018; Accepted: 20 February 2019; updates to the site. The authors are pleased to acknowledge that this work was performed
Published online: 28 March 2019 using the high-performance computing resources at Simons Foundation and the
TIGRESS computer center at Princeton University. This work was supported by NIH
grants R01HG005998, U54HL117798, R01GM071966, and T32HG003284; HHS grant
References HHSN272201000054C; and Simons Foundation grant 395506, all to O.G.T. O.G.T. is a
1. LeCun, Y., Bengio, Y. & Hinton, G. Nature 521, 436–444 (2015). CIFAR fellow.
2. Ching, T. et al. J. R. Soc. Interface. 15, 20170387 (2018).
3. Segler, M. H. S., Preuss, M. & Waller, M. P. Nature 555, 604–610 (2018).
4. Zhou, J. & Troyanskaya, O. G. Nat. Meth. 12, 931–934 (2015).
5. Alipanahi, B., Delong, A., Weirauch, M. T. & Frey, B. J. Nat. Biotechnol. 33,
Author contributions
K.M.C and J.Z. conceived the Selene library. K.M.C. and E.M.C. designed, implemented,
831–838 (2015).
and documented Selene. K.M.C. performed the analyses described in the manuscript.
6. Kelley, D. R., Snoek, J. & Rinn, J. L. Genome Res. 26, 990–999 (2016).
O.G.T. supervised the project. K.M.C., E.M.C., and O.G.T wrote the manuscript.
7. Angermueller, C., Lee, H. J., Reik, W. & Stegle, O. Genome. Biol. 18, 67 (2017).
8. Kelley, D. R. et al. Genome Res. 28, 739–750 (2018).
9. Quang, D. & Xie, X. Nucleic Acids Res. 44, e107 (2016).
10. Sundaram, L. et al. Nat. Genet. 50, 1161–1170 (2018). Competing interests
11. Min, S., Lee, B. & Yoon, S. Brief. Bioinform. 18, 851–869 (2017). The authors declare no competing interests.
12. Budach, S. & Marsico, A. Bioinformatics 34, 3035–3037 (2018).
13. Avsec, Z. et al. bioRxiv Preprint at https://www.biorxiv.org/
content/10.1101/375345v1 (2018).
14. Mei, S. et al. Nucleic Acids Res. 45, D658–D662 (2017). Additional information
15. Troyanskaya, O. G. et al. Selene CLI operations and outputs. Selene Supplementary information is available for this paper at https://doi.org/10.1038/
https://selene.flatironinstitute.org/overview/cli.html (2018). s41592-019-0360-8.
16. Ruiz, A. et al. Transl. Psychiatry 4, e358 (2014). Reprints and permissions information is available at www.nature.com/reprints.
17. Huang, K.-L. et al. Nat. Neurosci. 20, 1052–1061 (2017).
Correspondence and requests for materials should be addressed to O.G.T.
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in
Acknowledgements published maps and institutional affiliations.
The authors acknowledge all members of the Troyanskaya lab for helpful discussions.
In addition, the authors thank D. Simon for setting up the website and automating © The Author(s), under exclusive licence to Springer Nature America, Inc. 2019

318 Nature Methods | VOL 16 | APRIL 2019 | 315–318 | www.nature.com/naturemethods


NATure MeThoDS Brief Communication
Methods (5) Convolutional layer (480 kernels; window size, 8; step size, 1)
Overview of Selene. Selene consists of two components: a Python library for (6) Pooling layer (window size, 4; step size, 4)
developing sequence-level neural networks, and a command-line interface (CLI) (7) Convolutional layer (960 kernels; window size, 8; step size, 1)
for prototypical use cases of the library (that is, training a new model, evaluating (8) Convolutional layer (960 kernels; window size, 8; step size, 1)
an existing model, and analyzing sequence data and variants with a trained model). (9) Fully connected layer (919 genomic features)
We herein refer to these components as the software development kit (SDK) and (10) Sigmoid output layer
the CLI, respectively. All functionality provided by the CLI is also available to Dropout proportion:
the user through the SDK. Rather than supplanting the SDK, the CLI is intended
to maximize code reuse and minimize user time spent learning SDK by heavily • Layer 5: 20%
reducing the configuration tasks left to the user (for example, when GPU usage • Layer 8: 50%
is specified, the CLI ensures all appropriate computations are performed on the Batch normalization applied after layers 2, 5, and 8 and before dropout.
GPU). When appropriate, the SDK does deliver functionality beyond that of the Both architectures use the binary cross-entropy loss function and stochastic
CLI. For instance, the SDK includes several data visualization methods that would gradient descent optimizer (momentum, 0.9; weight decay, 10−6).
be too unwieldy as executables run from the command line.
Thorough documentation for the SDK is available at https://selene. Reproducing the case studies. Below, we have described the steps taken for each
flatironinstitute.org, and tutorials for both the CLI and SDK can be found on of the case studies. The code required to reproduce each case study is included
the GitHub page (https://github.com/FunctionLab/selene). Notably, one tutorial in the GitHub repository (https://github.com/FunctionLab/selene/tree/master/
demonstrates how to use Selene to train a deep neural network regression model manuscript) and was run with Selene version 0.2.0. We have also created Zenodo
(https://github.com/FunctionLab/selene/blob/master/tutorials/regression_mpra_ records for each case that contain all the input data, data processing scripts and
example/regression_mpra_example.ipynb). This tutorial illustrates Selene’s use output files generated from Selene:
outside of the models of transcriptional regulation shown in the case studies.
• Case 1: https://doi.org/10.5281/zenodo.1442433
Selene software development kit. The Selene SDK, formally known as selene_sdk, • Case 2: https://doi.org/10.5281/zenodo.1442437
is an extensible Python package intended to ease the development of new programs • Case 3: https://doi.org/10.5281/zenodo.1445555
that leverage sequence-level models through code reuse. The Selene CLI is built
entirely on the functionality provided by the SDK, but it is probable that users will Case 1: training a state-of-the-art architecture on a different dataset. Steps to
use the SDK outside this context. For example, after training a new sequence-level train DeepSEA on new data.
model with the CLI, one could use the SDK in conjunction with a Python-based (1) Download the data from Cistrome. In this case, we are only working with one
web application framework (e.g., Flask, Django) to build a web server so that dataset for one specific genomic feature. Cistrome ID 33545, measurements
other researchers can submit sequences or variants and get the trained model’s from GSM970258.
predictions as output. (2) Format the data. We use tools from Samtools18 (specifically, tabix19 and bgzip
Leveraging the SDK in a user’s Python project is no different from using any from HTSlib, https://www.htslib.org/). Create a .bed file of chromosome,
other Python module. That is, one only needs to import the selene_sdk module start, end and the genomic feature name (useful when there is more than
or any of its members and supply them with the correct parameters. The runtime one feature). Sort this file and compress it into a .gz file. Tabix index this file.
behavior of each component of selene_sdk, as well as the required parameters for all Specific commands:
members of selene_sdk, is described in detail in the online documentation (https://
selene.flatironinstitute.org/overview/overview.html). (i) Only use the columns [chr, start, end]: cut -f 1-3
<peaks-file> > <peak-coordinates-file>. Note: Eventu-
Selene CLI. The Selene CLI is a usable program to be run from the command ally, we will add support for parsing BED files with strand specific
line by the user. It encapsulates the configuration, execution, and logging of features and/or continuous values that quantify these features
Selene’s most common use cases. These use cases are embodied by the CLI’s (ii) Add the genomic feature name as the fourth column of the file: sed -i
three commands: train, evaluate and analyze. These commands are used to ‘s/$//t<feature-name>/’ <peak-coordinates-file>
train new models, evaluate the performance of trained models, and analyze (iii) Sort the file by [chr, start, end]: sort -k1V -k2n -k3n
model predictions (perform in silico mutagenesis or variant effect prediction), <peak-coordinates-file> > <sorted-coordinates-
respectively. Each command configures its specific runtime environment with file>
a combination of command line arguments and parameters drawn from user- (iv) Compress the file: bgzip <sorted-coordinates-file> This
provided configuration files. The flexibility of these configuration files allows them compresses the file to a .gz file in place. To separately generate the .gz
to leverage user-developed code as well, and further extends the usability of the file, run bgzip -c <sorted-coordinates-file> >
CLI. We provide a step-by-step tutorial that describes the CLI configuration file <sorted-coordinates-file>.gz
format and shows some example configuration keys and values (https://github. (v) Tabix index the file: tabix -p bed <sorted-coordinates-
com/FunctionLab/selene/blob/master/tutorials/getting_started_with_selene/ file>.gz
getting_started_with_selene.ipynb); examples of CLI configuration code files are
available at https://github.com/FunctionLab/selene/tree/master/config_examples. (3) Create a file of distinct features that the model will predict, where each feature
Finally, comprehensive documentation detailing all possible configurations is a single line in the file. This can easily be created from the .bed file in step 2
supported by Selene can be found on Selene’s documentation website (https:// by running cut -f 4 <peak-coordinates-file> | sort -u >
selene.flatironinstitute.org/overview/cli.html). Users can reference any of these <distinct-features>.
resources when creating their own configuration files. (4) Download the GRCh38/hg38 FASTA file. We downloaded the reference sequences
GRCh37/hg19 and GRCh38/hg38 used in our analyses from ENCODE20.
Model architectures. DeepSEA architecture used in case 1 (from the (5) Specify the model architecture, loss and optimizer as a Python file. An
supplementary note in the DeepSEA publication4): example of this is available for DeepSEA at https://github.com/FunctionLab/
(1) Convolutional layer (320 kernels; window size, 8; step size, 1) selene/blob/master/models/deepsea.py.
(2) Pooling layer (window size, 4; step size, 4) (6) Fill out the configuration file with the appropriate file paths and training
(3) Convolutional layer (480 kernels; window size, 8; step size, 1) parameters. We recommend starting from one of the example training files
(4) Pooling layer (window size, 4; step size, 4) in the GitHub tutorials (for example, https://github.com/FunctionLab/selene/
(5) Convolutional layer (960 kernels; window size, 8; step size, 1) blob/master/tutorials/getting_started_with_selene/getting_started_with_se-
(6) Fully connected layer (919 genomic features) lene.ipynb) or in the ‘config_examples’ directory (https://github.com/
(7) Sigmoid output layer FunctionLab/selene/tree/master/config_examples). You can also review the
documentation for the configuration parameters on Selene’s website15.
Dropout proportion (proportion of outputs randomly set to 0): (7) Run Selene.
• Layer 2: 20%
• Layer 4: 20% Steps to apply and visualize the results of in silico mutagenesis.
• Layer 5: 50%
• All other layers: 0% (1) Collect sequences you want to visualize as a FASTA file. For this particular
case, we provide a script to do so (https://github.com/FunctionLab/selene/
Architecture used in cases 2 and 3: blob/master/manuscript/case1/data/get_test_regions.py).
(1) Convolutional layer (320 kernels; window size, 8; step size, 1) (2) Fill out the configuration file with the appropriate file paths (for example, the
(2) Convolutional layer (320 kernels; window size, 8; step size, 1) path to the FASTA file and the trained model weights file).
(3) Pooling layer (window size, 4; step size, 4) (3) Run Selene. You will get the raw predictions and the log2 fold change scores as
(4) Convolutional layer (480 kernels; window size, 8; step size, 1) output files.

Nature Methods | www.nature.com/naturemethods


Brief Communication NATure MeThoDS
(4) Follow one of the Jupyter notebook tutorials for in silico mutagenesis (https:// (6) Fill out the configuration file with the appropriate file paths and training
github.com/FunctionLab/selene/tree/master/tutorials) to generate visualiza- parameters. We set the training parameters (number of steps, batches and
tions for the sequences. We have done this at https://github.com/Function- so on) so that they matched how DeepSEA was originally trained.
Lab/selene/blob/master/manuscript/case1/3_visualize_ism_outputs.ipynb. (7) Run Selene.

Case 2: developing a new architecture and making model comparisons. Steps to Case 3: applying a new model to variants.
train ‘deeper DeepSEA’ on the same exact data as DeepSEA.
(1) Download the single-nucleotide polymorphisms from the International
(1) Download the code and data bundle from the DeepSEA website (http:// Genomics of Alzheimer’s Project. (https://www.niagads.org/igap-age-onset-
deepsea.princeton.edu/media/code/deepsea_train_bundle.v0.9.tar.gz). You survival-analyses-p-value-only).
only need the .mat files in this directory. We also include a file listing the (2) Group the variants into those with P values below 0.05 (significant) and those
919 genomic features that the model predicts. This is from the resources with P values above 0.50 (non-significant).
directory in the standalone version of DeepSEA (http://deepsea.princeton. (3) Fill out the configuration file with the paths to the two variant files and the
edu/media/code/deepsea.v0.94b.tar.gz). Zenodo record: https://zenodo.org/ trained model weights file from case 2.
record/2214970/files/DeepSEA_data.tar.gz. (4) Run Selene.
(2) Fill out the configuration file for Selene’s MultiFileSampler (https://selene. (5) Follow the script provided for this case to analyze the variant predictions
flatironinstitute.org/overview/cli.html#multiple-file-sampler) and specify (https://github.com/FunctionLab/selene/blob/master/manuscript/case3/2_
the path to each .mat file for training, validation and testing. variant_groups_comparison.sh).
(3) Run Selene.
Please see the DeepSEA publication4 for details about data processing and Statistical analysis. Details of the statistical test used for case study 3 are specified
training. in the associated text and figure legend (Fig. 3b).
In the main text, we report test performance for the model trained using the
online sampler. When training on the same exact data (the .mat files) as DeepSEA, Reporting Summary. Further information on research design is available in the
we achieve an average AUC of 0.934 and an average AUPRC of 0.361. Nature Research Reporting Summary linked to this article.

Steps to download and format all the peak data from ENCODE and Roadmap Code availability
Epigenomics. Selene is open-source software (license BSD 3-Clause Clear). Project homepage:
(1) Download all chromatin feature profiles used for training DeepSEA, specified https://selene.flatironinstitute.org. GitHub: https://github.com/FunctionLab/selene.
in Supplementary Table 1 of the DeepSEA manuscript4 (https://zenodo.org/ Archived version: https://github.com/FunctionLab/selene/archive/0.2.0.tar.gz.
record/2214970/files/chromatin_profiles.tar.gz).
(2) For each file, keep the chromosome, start, and end columns. In addition, create Data availability
a fourth column with the feature’s name. Concatenate all these files and create Cistrome14, Cistrome file ID 33545, measurements from GSM970258: http://
the distinct features file. We provide a Python script for this step on the GitHub dc2.cistrome.org/api/downloads/eyJpZCI6IjMzNTQ1In0%3A1fujCu%3ArNv
page (https://github.com/FunctionLab/selene/blob/master/manuscript/ WLCNoET6o9SdkL8fEv13uRu4b/. ENCODE21 and Roadmap Epigenomics22
case2/1_train_with_online_sampler/data/process_chromatin_profiles.py). chromatin profiles: files listed in Supplementary Table 1 of ref. 4. IGAP age at onset
(3) Format the data according to the instructions in the ‘Getting started’ tutorial: survival16,17: https://www.niagads.org/datasets/ng00058 (P-values-only file). The
i. Sort the file by [chr, start, end]: sort -k1V -k2n -k3n case studies used processed datasets from these sources. They can be downloaded
<peak-coordinates-file> > <sorted-coordinates-file>. at the following Zenodo links: Cistrome, https://zenodo.org/record/2214130/files/
ii. Compress the file: bgzip <sorted-coordinates-file>. This data.tar.gz; ENCODE and Roadmap Epigenomics chromatin profiles, https://
compresses the file to a .gz file in place. To separately generate the .gz file, zenodo.org/record/2214970/files/chromatin_profiles.tar.gz; IGAP age at onset
run bgzip -c <sorted-coordinates-file> > survival, https://zenodo.org/record/1445556/files/variant_effect_prediction_data.
<sorted-coordinates-file>.gz. tar.gz. Source data for Figs. 2 and 3 are available online.
iii. Tabix index the file: tabix -p bed <sorted-coordinates-
file>.gz.
References
 script containing these steps can be downloaded from https://github.com/
A 18. Li, H. et al. Bioinformatics 25, 2078–2079 (2009).
FunctionLab/selene/blob/master/manuscript/case2/1_train_with_online_ 19. Li, H. Bioinformatics 27, 718–719 (2011).
sampler/data/process_data.sh. 20. ENCODE Project. Reference sequences. ENCODE: Encyclopedia of DNA
(4) Download the hg19 FASTA file (https://www.encodeproject.org/files/male. Elements https://www.encodeproject.org/data-standards/reference-sequences/
hg19/@@download/male.hg19.fasta.gz). (2016).
(5) Specify the model architecture, loss, and optimizer as a Python file: https://github. 21. ENCODE Project Consortium. Nature 489, 57–74 (2012).
com/FunctionLab/selene/blob/master/selene_sdk/utils/example_model.py. 22. Kundaje, A. et al. Nature 518, 317–330 (2015).

Nature Methods | www.nature.com/naturemethods


nature research | reporting summary
Corresponding author(s): Olga Troyanskaya

Reporting Summary
Nature Research wishes to improve the reproducibility of the work that we publish. This form provides structure for consistency and transparency
in reporting. For further information on Nature Research policies, see Authors & Referees and the Editorial Policy Checklist.

Statistical parameters
When statistical analyses are reported, confirm that the following items are present in the relevant location (e.g. figure legend, table legend, main
text, or Methods section).
n/a Confirmed
The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement
An indication of whether measurements were taken from distinct samples or whether the same sample was measured repeatedly
The statistical test(s) used AND whether they are one- or two-sided
Only common tests should be described solely by name; describe more complex techniques in the Methods section.

A description of all covariates tested


A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons
A full description of the statistics including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient) AND
variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals)

For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted
Give P values as exact values whenever suitable.

For Bayesian analysis, information on the choice of priors and Markov chain Monte Carlo settings
For hierarchical and complex designs, identification of the appropriate level for tests and full reporting of outcomes
Estimates of effect sizes (e.g. Cohen's d, Pearson's r), indicating how they were calculated

Clearly defined error bars


State explicitly what error bars represent (e.g. SD, SE, CI)

Our web collection on statistics for biologists may be useful.

Software and code


Policy information about availability of computer code
Data collection No software was used to collect data in this study (that is, no data was collected).

Data analysis Project homepage: https://selene.flatironinstitute.org


GitHub: https://github.com/FunctionLab/selene
Archived version: https://github.com/FunctionLab/selene/archive/0.2.0.tar.gz

Additional software used:


Samtools (version 1.9). Specifically, tabix and bgzip in the HTSlib (version 1.9) package.

For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors/reviewers
April 2018

upon request. We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Research guidelines for submitting code & software for further information.

1
Data

nature research | reporting summary


Policy information about availability of data
All manuscripts must include a data availability statement. This statement should provide the following information, where applicable:
- Accession codes, unique identifiers, or web links for publicly available datasets
- A list of figures that have associated raw data
- A description of any restrictions on data availability
Data sources
Cistrome
Cistrome file ID: 33545, measurements from GSM970258 (Xu et al., 2012)
http://dc2.cistrome.org/api/downloads/eyJpZCI6IjMzNTQ1In0%3A1fujCu%3ArNvWLCNoET6o9SdkL8fEv13uRu4b/

ENCODE and Roadmap Epigenomics chromatin profiles


Files listed in https://media.nature.com/original/nature-assets/nmeth/journal/v12/n10/extref/nmeth.3547-S2.xlsx

IGAP age at onset survival


https://www.niagads.org/datasets/ng00058 (p-values only file)

Processed datasets from these sources are available at the following Zenodo links:
Cistrome:
https://zenodo.org/record/2214130/files/data.tar.gz
ENCODE and Roadmap Epigenomics chromatin profiles:
https://zenodo.org/record/2214970/files/chromatin_profiles.tar.gz
IGAP age at onset survival:
https://zenodo.org/record/1445556/files/variant_effect_prediction_data.tar.gz

Field-specific reporting
Please select the best fit for your research. If you are not sure, read the appropriate sections before making your selection.
Life sciences Behavioural & social sciences Ecological, evolutionary & environmental sciences
For a reference copy of the document with all sections, see nature.com/authors/policies/ReportingSummary-flat.pdf

Life sciences study design


All studies must disclose on these points even when the disclosure is negative.
Sample size Only previously published data were used.

Data exclusions Only previously published data were used.

Replication This study does not present experimental findings.

Randomization This study does not present experimental findings.

Blinding We did not need blinding since we had no randomized control trials.

Reporting for specific materials, systems and methods

Materials & experimental systems Methods


n/a Involved in the study n/a Involved in the study
April 2018

Unique biological materials ChIP-seq


Antibodies Flow cytometry
Eukaryotic cell lines MRI-based neuroimaging
Palaeontology
Animals and other organisms
Human research participants

You might also like